BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory - PowerPoint PPT Presentation

BBM406 Fundamentals of   Machine Learning Lecture 6: Learning theory Probability Review Aykut Erdem // Hacettepe University // Fall 2019

Last time… Regularization , Cross-Validation error Validation error the data Training error number of base functions 50 NN classifier 5-NN classifier Underfitting Just Right Overfitting • large training • small training • small training error error error • large • small • large validation validation validation error error error Figure credit: Fei-Fei Li, Andrej Karpathy, Justin Johnson 2

Today • Learning Theory • Probability Review 3

Learning Theory:   Why ML Works 4

Computational Learning   Theory • Entire subfield devoted to the   ( mathematical analysis of machine   learning algorithms • Has led to several practical methods: − PAC (probably approximately correct) learning   → boosting − VC (Vapnik–Chervonenkis) theory   → support vector machines   slide by Eric Eaton Annual conference: Conference on Learning Theory (COLT) 5

The Role of Theory • Theory can serve two roles: − It can justify and help understand why theory after common practice works. − It can also serve to suggest new algorithms and approaches that turn out to work well in theory before practice. adapted from Hal Daume III Often, it turns out to be a mix! 6

The Role of Theory • Practitioners discover something that works surprisingly well. • Theorists figure out why it works and prove something about it. − In the process, they make it better or find new algorithms. • Theory can also help you understand what’s adapted from Hal Daume III possible and what’s not possible. 7

Learning and Inference The inductive inference process: 1. Observe a phenomenon 2. Construct a model of the phenomenon 3. Make predictions • This is more or less the definition of natural sciences ! • The goal of Machine Learning is to automate   slide by Olivier Bousquet this process • The goal of Learning Theory is to formalize it. 8

Pattern recognition • We consider here the supervised learning framework for pattern recognition: − Data consists of pairs (instance, label) − Label is +1 or − 1 − Algorithm constructs a function (instance → label) − Goal: make few mistakes on future unseen instances slide by Olivier Bousquet 9

Approximation/Interpolation • It is always possible to build a function that fits exactly the data. 1.5 1 0.5 0 0 0.5 1 1.5 • But is it reasonable? 10

    Occam’s Razor • Idea: look for regularities in the observed   phenomenon   These can be generalized from the   observed past to the future   ⇒ choose the simplest consistent model • How to measure simplicity ? − Physics: number of constants − Description length − Number of parameters − ... 11

No Free Lunch • No Free Lunch − if there is no assumption on how the past is related to the future, prediction is impossible − if there is no restriction on the possible phenomena, generalization is impossible • We need to make assumptions • Simplicity is not absolute • Data will never replace knowledge • Generalization = data + knowledge 12

Probably Approximately Correct   (PAC) Learning • A formalism based on the realization that the best we can hope of an algorithm is that − It does a good job most of the time ( probably approximately correct ) adapted from Hal Daume III 13

Probably Approximately Correct   (PAC) Learning • Consider a hypothetical learning algorithm − We have 10 di ff erent binary classification data sets. − For each one, it comes back with functions f 1 , f 2 , . . . , f 10 . ✦ For some reason, whenever you run f 4 on a test point, it crashes your computer. For the other learned functions, their performance on test data is always at most 5% error. ✦ If this situtation is guaranteed to happen, then this hypothetical learning algorithm is a PAC learning algorithm. ✤ It satisfies probably because it only failed in one out of adapted from Hal Daume III ten cases, and it’s approximate because it achieved low, but non-zero, error on the remainder of the cases. 14

PAC Learning Definitions 1 . An algorithm A is an ( e , d ) -PAC learning algorithm if, for all distributions D : given samples from D , the probability that it returns a “bad function” is at most d ; where a “bad” function is one with test error rate more than e on D . adapted from Hal Daume III 15

PAC Learning • Two notions of e ffi ciency − Computational complexity: Prefer an algorithm that runs quickly to one that takes forever − Sample complexity: The number of examples required for your algorithm to achieve its goals Definition: An algorithm A is an efficient ( e , d ) -PAC learning algorithm if it is an ( e , d ) -PAC learning algorithm whose runtime is polynomial in 1 e and 1 d . In other words, suppose that you want your algorithm to achieve adapted from Hal Daume III In other words, to let your algorithm to achieve   4% error rather than 5%, the runtime required   to do so should not go up by an exponential factor! 16

  Example: PAC Learning of Conjunctions • Data points are binary vectors, for instance x = ⟨ 0, 1, 1, 0, 1 ⟩ • Some Boolean conjunction defines the true labeling of this data   (e.g. x 1 ⋀ x 2 ⋀ x 5 ) • There is some distribution D X over binary data points (vectors)   x = ⟨ x 1 , x 2 , . . . , x D ⟩ . • There is a fixed concept conjunction c that we are trying to learn. • There is no noise, so for any example x , its true label is simply   y = c ( x ) • Example: adapted from Hal Daume III y x 1 x 2 x 3 x 4 − Clearly, the true formula cannot   + 1 0 0 1 1 include the terms x 1 , x 2 , ¬ x 3 , ¬ x 4   + 1 0 1 1 1 - 1 1 1 0 1 able 10 . 1 : Data set for learning con- 17

Example: PAC Learning Algorithm 30 B inary C onjunction T rain ( D ) 1 : f ← x 1 ∧ ¬ x 1 ∧ x 2 ∧ ¬ x 2 ∧ · · · ∧ x D ∧ ¬ x D // initialize function 2 : for all positive examples ( x , + 1 ) in D do of Conjunctions for d = 1 . . . D do 3 : if x d = 0 then 4 : f ← f without term “ x d ” 5 : else 6 : y x 1 x 2 x 3 x 4 f ← f without term “ ¬ x d ” 7 : end if + 1 8 : 0 0 1 1 end for 9 : “ Throw Out Bad Terms” 10 : end for + 1 0 1 1 1 11 : return f - 1 1 1 0 1 able 10 . 1 : Data set for learning con- f 0 ( x ) = x 1 ⋀ ¬ x 1 ⋀ x 2 ⋀ ¬ x 2 ⋀ x 3 ⋀ ¬ x 3 ⋀ x 4 ⋀ ¬ x 4 f 1 ( x ) = ¬ x 1 ⋀ ¬ x 2 ⋀ x 3 ⋀ x 4 2 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f 3 ( x ) = ¬ x 1 ⋀ x 3 ⋀ x 4 f • After processing an example, it is guaranteed to classify that adapted from Hal Daume III example correctly (provided that there is no noise) • Computationally very efficient − Given a data set of N examples in D dimensions, it takes O ( ND ) time to process the data. This is linear in the size of the data set. 18

Example: PAC Learning Algorithm 30 B inary C onjunction T rain ( D ) 1 : f ← x 1 ∧ ¬ x 1 ∧ x 2 ∧ ¬ x 2 ∧ · · · ∧ x D ∧ ¬ x D // initialize function 2 : for all positive examples ( x , + 1 ) in D do of Conjunctions for d = 1 . . . D do 3 : if x d = 0 then 4 : f ← f without term “ x d ” 5 : else 6 : y x 1 x 2 x 3 x 4 f ← f without term “ ¬ x d ” 7 : end if + 1 8 : 0 0 1 1 end for 9 : “ Throw Out Bad Terms” 10 : end for + 1 0 1 1 1 11 : return f - 1 1 1 0 1 able 10 . 1 : Data set for learning con- • Is this an e ffi cient ( ε , δ ) -PAC learning algorithm? • What about sample complexity ? − How many examples N do you need to see in order to guarantee that it achieves an error rate of at most ε (in all but δ - many cases)? most e adapted from Hal Daume III (like 2 2 D / e ) − Perhaps N has to be gigantic (like ) to (probably) guarantee a small error. 19

Vapnik-Chervonenkis   (VC) Dimension • A classic measure of complexity of infinite hypothesis classes based on this intuition. • The VC dimension is a very classification-oriented notion of complexity − The idea is to look at a finite set of unlabeled examples − no matter how these points were labeled, would we be able to find a hypothesis that correctly classifies them • The idea is that as you add more points, being able to represent an arbitrary labeling becomes harder and harder. adapted from Hal Daume III Definitions 2 . For data drawn from some space X , the VC dimension of a hypothesis space H over X is the maximal K such that: there exists a set X ⊆ X of size | X | = K, such that for all binary labelings of X, there exists a function f ∈ H that matches this labeling. 20

How many points can a linear boundary classify exactly? (1-D) • 2 points: Yes! • 3 points: No! slide by David Sontag etc (8 total) VC-dimension = 2 21

    How many points can a linear boundary classify exactly? (2-D) • 3 points: Yes!   • 4 points: No! slide by David Sontag VC-dimension = 3 figure credit: Chris Burges 22

Basic Probability   Review 23

Probability • A is non-deterministic event   – Can think of A as a boolean-valued variable • Examples   – A = your next patient has cancer   – A = Rafael Nadal wins French Open 2019 slide by Dhruv Batra 24

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory - PowerPoint PPT Presentation

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut Erdem // Hacettepe University // Fall 2019 Last time Regularization , Cross-Validation error Validation error the data Training error number

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

BBM406 Fundamentals of Machine Learning Lecture 14: Deep Convolutional Networks Aykut Erdem

Lecture 16 : Independence, Covariance and Correlation of Discrete Random Variables 0/ 31

Political Science 209 - Fall 2018 Probability Florian Hollenbach 26th October 2018 Why

Variance; Continuous Random Variables 18.05 Spring 2014 January 1, 2017 1 / 26 Variance and

Conditional Probability, Independence, Bayes Theorem 18.05 Spring 2014 January 1, 2017 1

A Signal-Processing Framework for Inverse Rendering Ravi Ramamoorthi Pat Hanrahar Computer

Bayesian networks Chapter 14, Sections 14 of; based on AIMA Slides c Artificial Intelligence,

Objectives You should be able to ... Loop Invariants Explain the concept of well formed

Bayes Nets AI Class 10 (Ch. 14.114.4.2; skim 14.3) Weather Cavity Toothache Catch Based on