active learning for classification with abstention
play

Active Learning for Classification with Abstention Shubhanshu Shekhar - PowerPoint PPT Presentation

Active Learning for Classification with Abstention Shubhanshu Shekhar 1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020 1 shshekha@eng.ucsd.edu 1 / 17


  1. Active Learning for Classification with Abstention Shubhanshu Shekhar 1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020 1 shshekha@eng.ucsd.edu 1 / 17

  2. Introduction: Classification with Abstention Feature space X , P XY a joint distribution on X × { 0 , 1 } , and augmented label set ¯ Y = { 0 , 1 , ∆ } , where ∆ is the ‘abstain’ option. Risk of an abstaining classifier g : X �→ ¯ Y R λ ( g ) := P XY ( g ( X ) � = Y , g ( X ) � = ∆) + λ P X ( g ( X ) = ∆) . � �� � � �� � risk from mis-classification risk from abstention Cost of mis-classification is 1, while abstention cost is λ ∈ (0 , 1 / 2). Goal: Given access only to a training sample D n = { ( X i , Y i ) : 1 ≤ i ≤ n } , construct a classifier g n with small risk. 2 / 17

  3. Introduction: Active and Passive Learning Passive Learning: D n := { ( X i , Y i ) | 1 ≤ i ≤ n } is generated in an i . i . d . manner. Active Learning: the learner constructs D n by sequentially querying a labelling oracle. Three commonly used active learning models: membership query: request label at any x ∈ X Stream based: query/discard from a stream of inputs X t ∼ P X . Pool based: query points from a given pool of inputs. This talk: Membership query. Paper: All three models. 3 / 17

  4. Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. 4 / 17

  5. Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. Questions 1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 / 17

  6. Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. Questions 1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 Computationally efficient active algorithms via convex surrogates 5 Active learning for Neural Networks with abstention 4 / 17

  7. Introduction: Bayes Optimal Classifier Define the regression function η ( x ) := P XY ( Y = 1 | X = x ). For a fixed cost of abstention λ ∈ (0 , 1 / 2), the Bayes optimal classifier g ∗ λ is defined as:  1 if η ( x ) ≥ 1 − λ   g ∗ λ = 0 if η ( x ) ≤ λ   ∆ if η ( x ) ∈ ( λ, 1 − λ ) The optimal classifier can equivalently be represented by the triplet ( G ∗ 0 , G ∗ 1 , G ∗ ∆ ) where G ∗ v := { x ∈ X | g ( x ) = v } for v ∈ { 0 , 1 , ∆ } . 5 / 17

  8. Introduction: Bayes Optimal Classifier 6 / 17

  9. Introduction: Bayes Optimal Classifier 6 / 17

  10. Assumptions We make the following two assumptions on the joint distribution P XY . (H¨ O) The regression function η is H¨ older continuous with parameters L > 0 and 0 < β ≤ 1, i.e., | η ( x 1 ) − η ( x 2 ) | ≤ Ld ( x 1 , x 2 ) β , ∀ x 1 , x 2 ∈ ( X , d ) . (MA) The joint distribution P XY of the input-label pair satisfies the margin assumption with parameters C 0 > 0 and α 0 ≥ 0, for γ ∈ { λ, 1 − λ } , which means that for any P X ( | η ( X ) − γ | ≤ t ) ≤ C 0 t α 0 , ∀ 0 < t ≤ 1 . We will use P ( α 0 , β ) to denote the class of P XY satisfying these conditions. 7 / 17

  11. Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t E_1 E_2 8 / 17

  12. Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 E_1 E_2 8 / 17

  13. Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E_1 E_2 8 / 17

  14. Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t E_1 E_2 8 / 17

  15. Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t Refine if term1 < term2, else Query at x t ∼ Unif( E t ). E_1 E_2 8 / 17

  16. Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t Refine if term1 < term2, else Query at x t ∼ Unif( E t ). Classifier Definition.  � �  1 if ℓ t n x > 1 − λ, E_1 E_2  � � g n ( x ) = ˆ 0 if u t n < λ, (1) x   ∆ otherwise . 8 / 17

  17. 9 / 17

  18. Bound on Excess Risk Theorem Under margin and smoothness assumptions and n large enough, for the classifier ˆ g n defined by (1) , we have � n − β ( α 0 +1) / (2 β + ˜ D ) � g n ) − R λ ( g ∗ λ )] = ˜ E [ R λ (ˆ O . Conversely, we have the following lower bound for any algorithm � n − β (1+ α 0 ) / (2 β + D ) � E [ R λ ( g ) − R λ ( g ∗ sup λ )] = Ω . P ∈P ( α 0 ,β ) ˜ D ≤ D is a measure of dimensionality of X near the classifier boundaries { x : η ( x ) ∈ { λ, 1 − λ } } . D depends on the parameters of both (MA) and (H¨ ˜ O) assumptions. 10 / 17 D ≤ D and there exists cases with ˜ ˜ D < D and ˜ D = D .

  19. Lower Bound: Key Inequality Lemma Suppose g = ( G 0 , G 1 , G ∆ ) is a classifier constructed by an algorithm A , and g ∗ λ = ( G ∗ 0 , G ∗ 1 , G ∗ ∆ ) . Then, for any λ ∈ (0 , 1 / 2) we have � � (1+ α 0 ) /α 0 R λ ( g ) − R λ ( g ∗ ( G ∗ ∆ \ G ∆ ) ∪ ( G ∆ \ G ∗ λ ) ≥ c P X ∆ ) . � �� � � �� � := E ( A , P XY , n ) ( Excess Risk ) := � E ( A , P XY , n ) Rest of the proof relies on a reduction to multiple hypothesis testing. This Lemma suggests an appropriate pseudo-metric, and motivates the construction of hypothesis set. 11 / 17

  20. Improvement Over Passive Corollary For passive algorithms A p , we can obtain lower bound E as: � n − β (1+ α 0 ) / ( D +2 β + α 0 β ) � inf sup E [ E ( A p , P , n )] = Ω . A p P ∈P ( α 0 ,β ) This rate is slower than the rate achieved by our active algorithm, � n − ( β (1+ α 0 ) / ( D +2 β ) � ≈ ˜ O , for all D , α 0 , β . In particular, the gain due to active learning increases with (i) decreasing D , (ii) increasing α 0 and (iii) increasing β . Under an additional strong density assumption, further improvements are possible. 12 / 17

  21. Improvement over passive: α 0 = 1 , β = 0 . 5 , D ∈ { 1 , 2 , . . . , 50 } log E log n lim n →∞ Dimension 13 / 17

  22. Improvement over passive: α 0 ∈ [1 , 10] , β = 0 . 5 , D = 5 log E log n lim n →∞ Margin Parameter α 0 14 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend