CS 6316 Machine Learning Introduction to Learning Theory Yangfeng - PowerPoint PPT Presentation

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer Science University of Virginia

Overview 1. A Toy Example 2. A Formal Model 3. Empirical Risk Minimization 4. Finite Hypothesis Classes 5. PAC Learning 6. Agnostic PAC Learning 1

Real-world Classification Problem Image classification 14M images, 20K categories 2

Real-world Classification Problem (II) Sentiment classification 192K businesses, 6.6M user reviews 3

A Toy Example

Question Based on the following observations, try to find out the shape/size of the area where the positive examples come from x 2 - - - + + + + - - x 1 We have to make certain assumptions, otherwise there is no way to answer this question. 5

Hypotheses Given these data points, answer the following two questions: 1. Which shape is the x 2 underlying distribution - of red points? - - ◮ A triangle + ◮ A rectangle + + ◮ A circle + - - 2. What is the size of that x 1 shape? 6

Basic Concepts (I) Domain set or input space X : the set of all possible examples x 2 - - - + + + + - - x 1 ◮ In the example, X � R 2 ◮ Each point x in X , x ∈ X , is called one instance . 7

Basic Concepts (II) Label set or output space Y : the set of all possible labels x 2 - - - + + + + - - x 1 ◮ In this toy example, Y ∈ { + , −} ◮ In this course, we often restrict the label set to be a two-element set, such as { + 1 , − 1 } 8

Basic Concept (III) Training set S : a finite sequence of pairs in X × Y , represented as {( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x m , y m )} with size m x 2 - - - + + + + - - x 1 9

Basic Concept: Hypothesis Space ◮ Hypothesis class or hypothesis space H : a set of functions that map instances to labels ◮ Each element h in this hypothesis class is called a hypothesis x 2 - - - + + + + - - x 1 Figure: Two hypotheses from the Circle class. 10

Basic Concept: Hypothesis Space (Cont.) If we represent a hypothesis by its parameter value, then each hypothesis corresponds one point in the hypothesis space. x 2 Center x 2 - - - + + + Center x 1 + - - x 1 radius Figure: Visualizing the Circle hypothesis class. 11

Basic Concept: Machine Learners ◮ A (machine) learner is an algorithm A that can find an optimal hypothesis from H based on the training set S ◮ This optimal hypothesis is represented as A ( S ) x 2 Center x 2 - - - + + + Center x 1 + - - x 1 radius ◮ A hypothesis space H is learnable if such an algorithm A exists 1 12 1 A precise definition will be provided later in this lecture.

Why a Toy Problem? With a toy problem, we can have the following conveniences that we usually do not have with real-world problem, ◮ Do not need data pre-processing ◮ Do not need feature engineering ◮ Make some unrealistic assumptions, e.g., ◮ Assume we know the underlying data distribution ◮ Assume at least one of the classifiers we pick will completely solve the problem 13

A Formal Model

Basic Concepts: Summary ◮ Domain set X ◮ Label set Y x 2 ◮ Training data S : the - - observations - + ◮ Hypothesis class H : + + rectangle class + - - ◮ A learner A : an x 1 algorithm that finds an optimal hypothesis 15

Data generation process An idealized process to illustrate the relations among domain set X , label set Y , and the training set S 1. the probability distribution D over the domain set X 2. sample an instance x ∈ X according to D 3. annotate it using the labeling function f as y � f ( x ) 16

Example Assume the data distribution D over the domain set X is defined as p ( x ) � 1 + 1 2 N ( x ; 2 , 1 ) 2 N ( x ; − 2 , 1 ) (1) � �� component 1 component 2 The specific data generation process: for each data point 1. Randomly select a Gaussian component 2. Sample x from the corresponding component 3. Label x based on which component was selected at step 1 ◮ Component 1: positive ◮ Component 2: negative 17

Example (Cont.) Figure: 1K examples generated with the previous process. 18

Measures of success ◮ The error of a classifier as the probability that it does not predict the correct label on a randomly generated instance x ◮ Definition L D , f ( h ) � P x ∼ D [ h ( x ) � f ( x )] (2) ◮ x ∼ D : an instance generated following the distribution D ◮ h ( x ) � f ( x ) : prediction from hypothesis h does not match the labeling function output ◮ L D , f ( h ) : the error of h is measured with respect to D and f 19

True Error/Risk Other names (used interchangably): ◮ the generalization error ◮ the true error ◮ the risk L D , f ( h ) � P x ∼ D [ h ( x ) � f ( x )] (3) 20

Example Assume we have the data distribution D and the labeling function f as following p ( y � + 1 ) � p ( y � − 1 ) � 1 2 (4) p ( x | y � + 1 ) � N ( x ; 2 , 1 ) p ( x | y � − 1 ) � N ( x ; − 2 , 1 ) 0 . 2 0 . 15 0 . 1 5 · 10 − 2 − 6 − 4 − 2 0 2 4 6 21 Note that, p ( x ) is the same as in the example of data generation process.

Example (Cont.) If h is defined as � + 1 p ( + 1 | x ) ≥ p (− 1 | x ) h ( x ) � (5) − 1 otherwise then what is L D , f ( h ) � P x ∼ D [ h ( x ) � f ( x )] ? 0 . 2 0 . 15 0 . 1 5 · 10 − 2 − 6 − 4 − 2 0 2 4 6 22

Example (Cont.) If h is defined as � + 1 p ( + 1 | x ) ≥ p (− 1 | x ) h ( x ) � (5) − 1 otherwise then what is L D , f ( h ) � P x ∼ D [ h ( x ) � f ( x )] ? 0 . 2 0 . 15 0 . 1 5 · 10 − 2 − 6 − 4 − 2 0 2 4 6 The Bayes predictor: the best predictor if we know the data distribution (more detail will be discussed later) 22

Comments Recall the definition of true risk with the data distribution D and the labeling function f L D , f ( h ) � P x ∼ D [ h ( x ) � f ( x )] (6) It impossible to compute L D , f ( h ) in practice, since we do not know ◮ the distribution of data generation D ◮ the labeling function f Alternative option: Empirical Risk 23

Empirical Risk Minimization

Empirical Risk The definition of the empirical risk ( or , empirical error, training error): L S ( h ) � |{ i ∈ [ m ] : h ( x i ) � y i }| (7) m Explanations ◮ [ m ] � { 1 , 2 , . . . , m } where m is the total number of instances in S ◮ { i ∈ [ m ] : h ( x i ) � y i } : the set of instances that h predicts wrong ◮ |{ i ∈ [ m ] : h ( x i ) � y i }| : the size of the set ◮ L S ( h ) defines with respect to the set S 25

Example Empirical risk is defined on the training set S : L S ( h ) � |{ i ∈ [ m ] : h ( x i ) � y i }| (8) m 26 Figure: 1K examples generated with the previous process.

Empirical Risk Minimization: Definition Empirical Risk Minimization (ERM): given the training set S and the hypothesis class H h ∈ argmin L S ( h ) (9) h ∈ H ◮ argmin stands for the set of hypotheses in H that achieve the minimum value of L S ( h ) over H ◮ In general, there is always at least one hypothesis that makes L S ( h ) � 0 with an unrealistically large H 27

Empirical Risk Minimization: Limitation For example, with an unrealistically large hypothesis class H , we can always minimize the empirical error and make it zero � y i if ( x � x i ) ∧ ( x i ∈ S ) h S ( x ) � (10) 0 otherwise no matter how many instances in S x 2 - - - + + + + - - x 1 28

Overfitting Although this is just an extreme case, it illustrates an important phenomenon, called overfitting x 2 - - - + + + + - - x 1 ◮ The performance on the training set is excellent; but on the whole distribution was very poor ◮ Continue our discussion on lecture 6: model selection and validation 29

Inductive Bias “ A learner that makes no a priori assumptions regarding the identity of the target concept2 has no rational basis for classifying any unseen instances. ” [Mitchell, 1997, Page 42] 2 labeling function, in the context of our discussion 30

Finite Hypothesis Classes

A Learning Problem Assume we know the following information: ◮ Domain set X � [ 0 , 1 ] ◮ Distribution D : the uniform distribution over X ◮ Label set Y � {− 1 , + 1 } ◮ Labeling function f � − 1 0 ≤ x < b f ( x ) � (11) b ≤ x ≤ 1 + 1 with b is unknown 32

A Learning Problem Assume we know the following information: ◮ Domain set X � [ 0 , 1 ] ◮ Distribution D : the uniform distribution over X ◮ Label set Y � {− 1 , + 1 } ◮ Labeling function f � − 1 0 ≤ x < b f ( x ) � (11) b ≤ x ≤ 1 + 1 with b is unknown The learning problem is defined as ◮ Given a set of observations S � {( x 1 , y 1 ) , . . . , ( x m , y m )} , is there a learning 32 algorithm that can find f (or identify b )?

A Training Set S Consider the following training sets, each of them contains 8 data points, can a learning algorithm find the dividing point? Training set S 3 3 Please refer to the demo code for more examples 33

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng - PowerPoint PPT Presentation

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer Science University of Virginia Overview 1. A Toy Example 2. A Formal Model 3. Empirical Risk Minimization 4. Finite Hypothesis Classes 5. PAC

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning Probabilistic KNN. Mark Girolami girolami@dcs.gla.ac.uk Department of

Lesson 3 Approximating Fourier series 1 Last lecture, we saw that the trapezoidal rule was

State-of-the-Art ! 30-85 errors are made per 1000 lines of source CS 619 Introduction to OO Design

MARKOV CHAIN MONTE CARLO METHODS MARKOV CHAIN MONTE CARLO METHODS MARKO LAINE, FMI MARKO LAINE,

Computational treatment of the error distribution in nonparametric regression with right-censored

Comparing against a benchmark IN TRODUCTION TO P ORTF OLIO AN ALYS IS IN P YTH ON Charlotte

4. Asymptotic Approximations http://aofa.cs.princeton.edu A N A L Y T I C C O M B I N A T O R I

Distance Sampling Simulations Overview Why simulate? How it works Automated survey

Sambuz

Useful Links

Newsletter

Mail Us