Selective Prediction Binary classifications Rong Zhou November 8, - PowerPoint PPT Presentation

Selective Prediction Binary classifications Rong Zhou November 8, 2017

Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1

What are selective classifiers?

Introduction Selective classifiers are: • allowed to reject making predictions without penalty. • compelling with applications where wrong classifications are not welcomed and partial domain for predictions is allowed. 2

Introduction From Hierarchical Concept Learning: A variation on the Valiant Model [2]: . . . the learner is (instead) supposed to give a program taking instances as input, and having three possible outputs: 1,0, and “I don’t know”. . . . Informally we call a learning algorithm useful if the program outputs “I don’t know” on at most a fraction ǫ of all instances . . . 3

What is an ideal selective classifier? Suppose we are given training examples labelled − 1 or 1, and the goal is to design an algorithm to find a good selective classifier. • The misclassification rate should not be the only measurement for selective classifiers. • A selective classifier with zero misclassification rate can be a very “bad” classifier. Examples? 4

Notations and Definitions For a selective classifier/predictor C in a binary classification problem where x i ∈ X and y i ∈ {− 1 , 1 } . • Coverage ( cover ( C )) : the probability that C predicts a label instead of 0. • Error ( err ( C )): the probability that the true label is the opposite of what C predicts [Note: 0 is not counted as errors]. • Risk ( risk ( C )): err ( C ) risk ( C ) = cover ( C ) An ideal classifier/predictor should have both error and coverage guarantees with high probability (1 − δ ). 5

Forms of selective predictors/classifiers For a specific sample x : • Confidence-rated Predictor [ p − 1 , p 0 , p 1 ] • Selective Classifier • ( h , γ x ) , where 0 ≤ γ x ≤ 1 , h ∈ H • ( h , g ( x )) where g ( x ) = 0 or 1 and h ∈ H . 6

The Realizable Setting

The Realizable Setting In the realizable setting, our target hypothesis h ∗ is in our hypothesis class H and the labels are corresponding to what h ∗ predicts. 7

An Optimization Problem We are given: • a set of n labelled examples S = {{ x 1 , y 1 } , { x 2 , y 2 } , . . . , { x n , y n }} • a set of m unlabelled examples U = { x n +1 , x n +2 , . . . , x n + m } • a set of hypotheses H Goal: learn a selective classifier/predictor with an error guarantee ǫ , and the best possible coverage for the unlabelled examples in U . 8

An Optimization Problem Confidence-rated predictor : A confidence-rated predictor ( C ) is a mapping from U to a set of m distributions over { -1,0,1 } . For example, if the i -th distribution is [ β i , 1 − β i − α i , α i ], then Pr ( C ( x i ) = − 1) = β i Pr ( C ( x i ) = 1) = α i Pr ( C ( x i ) = 0) = 1 − β i − α i Recall that the version space V is a candidate set of hypotheses in the hypothesis class H . 9

An Optimization Problem Algorithm 1: Confidence-rated Predictor [1] 1 Inputs: Labelled data S , unlabelled data U , error bound ǫ . 2 Compute version space V with respect to S . 3 Solve the linear program: m � max ( α i + β i ) i =1 subject to: ∀ i , α i + β i ≤ 1 ∀ i , α i , β i ≥ 0 � � ∀ h ∈ V , β i + α i ≤ ǫ m i : h ( x n + i )=1 i : h ( x n + i )= − 1 4 Output the confidence-rated predictor: { [ β i , 1 − β i − α i , α i ] , i = 1 , 2 , . . . , m } 10

An Optimization Problem Let a selective classifier ( C ) defined by a tuple ( h , ( γ 1 , γ 2 , . . . , γ m )) where h ∈ H , 0 ≤ γ i ≤ 1 for all i = 1 , 2 , . . . m . For any x i , C ( x i ) = h ( x i ) with probability γ i , and 0 with probability 1 − γ i . 11

An Optimization Problem Algorithm 2: Selective Classifier [1] 1 Inputs: Labelled data S , unlablelled data U , error bound ǫ . 2 Compute version space V with respect to S . Pick an arbitrary h 0 ∈ V 3 Solve the linear program: m � max γ i i =1 subject to: ∀ i , 0 ≤ γ i ≤ 1 � ∀ h ∈ V , γ i ≤ ǫ m i : h ( x n + i ) � = h 0 ( x n + i ) 4 Output the selective classifier: ( h 0 , ( γ 1 , γ 2 , . . . , γ m )) . 12

Optimization Problems Both algorithms can guarantee the ǫ error with optimal/“almost optimal” coverage. Some drawbacks using the optimization algorithms: • Only work for those m unlabelled samples. • Number of constraints can be infinite. 13

A More General Problem Now let’s generalize the problem: We are given: • a set of n labelled examples S = {{ x 1 , y 1 } , { x 2 , y 2 } , . . . , { x n , y n }} • a set of hypotheses H with VC dimension d Goal: learn a selective classifier/predictor with zero error over the distribution X and the largest possible coverage with high probability 1 − δ . 14

Notations and Definitions Let the selective classifier be: � h ( x ) if g ( x ) = 1 C ( x ) = ( h , g )( x ) = 0 if g ( x ) = 0 cover ( h , g ) = E [ g ( X )] Let ˆ h be the empirical error minimizer. Define the true error: err P ( h ) = Pr ( X , Y ) ∼ P ( h ( X ) � = Y ) 15

Notations and Definitions With respect to the hypothesis class H , distribution P over X , and real number r > 0, define a true error ball: V ( h , r ) = { h ′ ∈ H : err P ( h ′ ) ≤ err P ( h ) + r } and B ( h , r ) = { h ′ ∈ H : Pr X ∼ P { h ′ ( X ) � = h ( X ) } ≤ r } 16

Notations and Definitions Define the disagreement region of a hypotheses set H : DIS ( H ) = { x ∈ X : ∃ h 1 , h 2 ∈ H such that h 1 ( x ) � = h 2 ( x ) } For G ⊆ H , let ∆ G denotes the volume of the disagreement region. Specifically, ∆ G = Pr { DIS ( G ) } 17

Learning a Selective Classifier Algorithm 3: Selective Classifier Strategy 1 Inputs: n labelled data S , d , δ . 2 Output: a selective classifier (h,g) such that risk ( h , g ) = risk ( h ∗ , g ) 3 Compute version space V with respect to S . Pick an arbitrary h 0 ∈ V 4 Set G = V 5 Construct g such that g ( x ) = 1 if and only if x ∈ {X \ DIS ( G ) } 6 h = h 0 18

Learning a Selective Classifier Analysis of the Strategy ∀ x ∈ X , when g ( x ) = 1, the target hypothesis h ∗ agrees with h . ⇒ risk ( h , g ) = risk ( h ∗ , g ) 19

Learning a Selective Classifier (thm 2.15: Consistent Hypothesis error rate bound in terms of VC dimension ) For any n and δ ∈ (0 , 1), with probability at least 1 − δ , every hypothesis h ∈ V has error rate err P ( h ) ≤ 4 d ln(2 n + 1) + 4 ln 4 δ n Let r = 4 d ln(2 n +1)+4 ln 4 , we know that if h ∈ V , h ∈ V ( h ∗ , r ) δ n ⇒ V ⊆ V ( h ∗ , r ) 20

Learning a Selective Classifier Now, if h ∈ V ( h ∗ , r ) E [1 h ( X ) � = h ∗ ( X ) ] = E [1 h ( X ) � = Y ] ≤ r By definition, h ∈ B ( h ∗ , r ). Thus, with probability 1 − δ V ⊆ V ( h ∗ , r ) ⊆ B ( h ∗ , r ) ∆ V ≤ ∆ B ( h ∗ , r ) 21

Learning a Selective Classifier Recall the definition of disagreement coefficient : ∆ B ( h ∗ , r ) θ = sup r > 0 r we have: ∀ r ∈ (0 , 1) , ∆ B ( h ∗ , r ) ≤ θ · r Therefore, with probability at least 1 − δ , ∆ V ≤ ∆ B ( h ∗ , r ) ≤ θ · r cover ( h , g ) = 1 − ∆ V ≥ 1 − θ · r = 1 − θ 4 d ln(2 n + 1) + 4 ln 4 δ n 22

The Noisy Setting

The Noisy Setting In the noisy setting, our target hypothesis h ∗ is in our hypothesis class H but the labels are corresponding to the prediction of h ∗ with noises. 23

Learning a Selective Classifier - the Noisy Setting Algorithm 4: Selective Classifier Strategy - Noisy [3] 1 Inputs: n labelled data S , d , δ . 2 Output: a selective classifier (h,g) such that risk ( h , g ) = risk ( h ∗ , g ) with probability 1 − δ 3 Set ˆ h = ERM ( H , S ) so that ˆ h is any empirical risk minimizer from H . � 2 d ln( 2 ne d )+ln 8 4 Set G = ˆ V (ˆ h , 4 ) δ n 5 Construct g such that g ( x ) = 1 if and only if x ∈ {X \ DIS ( G ) } 6 h = ˆ h 24

Learning a Selective Classifier - the Noisy Setting Consider a loss function L ( Y , Y ). risk ( h , g ) = E [ L ( h ( X ) , Y )) · g ( X )] cover ( h , g ) Let h ∗ be the true risk minimizer, we define the excess loss class as: F = {L ( h ( x ) , y ) − L ( h ∗ ( x ) , y ) : h ∈ H } 25

Learning a Selective Classifier - the Noisy Setting Class F is said to be a ( β, B )- Bernstein class with respect to P (where 0 ≤ β ≤ 1 and B ≥ 1), if every f ∈ F satisfies E f 2 ≤ B ( E f ) β 26

Learning a Selective Classifier - the Noisy Setting We will proof the following lemmas to show the error guarantee and the coverage guarantee. [Note: The following proofs define the loss function to be 0/1 loss]. • If F is said to be a ( β, B )- Bernstein class with respect to P , then for any r > 0: V ( h ∗ , r ) ⊆ B ( h ∗ , Br β ) 27

Learning a Selective Classifier - the Noisy Setting Let � 2 d ln( 2 ne d ) + ln 2 δ σ ( n , δ, d ) = 2 n • For any 0 < δ < 1, and r > 0, with probability of at least 1 − δ , V (ˆ ˆ h , r ) ⊆ V ( h ∗ , 2 σ ( n , δ/ 2 , d ) + r ) 28

Learning a Selective Classifier - the Noisy Setting • Assume that H has disagreement coefficient θ and that F is said to be a ( β, B )- Bernstein class with respect to P , then for any r > 0 and 0 < δ < 1, with probability of at least 1 − δ : ∆ˆ V (ˆ h , r ) ≤ B θ (2 σ ( n , δ/ 2 , d ) + r ) β 29

Selective Prediction Binary classifications Rong Zhou November 8, - PowerPoint PPT Presentation

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers? Introduction Selective

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

Using selective pressure to improve protein Aude GRELAUD tridimensional structure prediction

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Selective Laser Trabeculoplasty Selective Laser Trabeculoplasty SLT SLT Jorge

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

Genetic variation for Wood Basic Density, Knot index and Their Genetic variation for Wood Basic

Natural Computing Lecture 2: Genetic Algorithms J. Michael Herrmann michael.herrman@ed.ac.uk

Lounge Evolutionary Algorithms Christopher Mark Gore http://www.cgore.com cgore@cgore.com

AP BIOLOGY This material is made freely available at www.njctl.org and is intended for the

Parallel Outlier Ensembles Yue Zhao, Zain Nasrullah Maciej K. Hryniewicki Zheng Li Department

Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint

Study of Deeper Learning: Opportunities and Outcomes Education Writers Association November 19,

Non-classical Heuristics for Classical Planning Erez Karpas Advisors: Carmel Domshlak Shaul

Selective Prediction Binary classifications Rong Zhou November 8, - PowerPoint PPT Presentation

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers? Introduction Selective

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Texas Instruments &amp; RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

Using selective pressure to improve protein Aude GRELAUD tridimensional structure prediction

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Selective Laser Trabeculoplasty Selective Laser Trabeculoplasty SLT SLT Jorge

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

DeepLoc Data set statistics &amp; performance Protein prediction II Gregor Sturm, Johannes Rest,

Genetic variation for Wood Basic Density, Knot index and Their Genetic variation for Wood Basic

Natural Computing Lecture 2: Genetic Algorithms J. Michael Herrmann michael.herrman@ed.ac.uk

Lounge Evolutionary Algorithms Christopher Mark Gore http://www.cgore.com cgore@cgore.com

AP BIOLOGY This material is made freely available at www.njctl.org and is intended for the

Parallel Outlier Ensembles Yue Zhao, Zain Nasrullah Maciej K. Hryniewicki Zheng Li Department

Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint

Study of Deeper Learning: Opportunities and Outcomes Education Writers Association November 19,

Non-classical Heuristics for Classical Planning Erez Karpas Advisors: Carmel Domshlak Shaul

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,