learning from corrupted binary labels via class
play

Learning from Corrupted Binary Labels via Class-Probability - PowerPoint PPT Presentation

Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57 Learning from binary


  1. Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57

  2. Learning from binary labels +" +" +" #" +" #" #" #" 2 / 57

  3. Learning from binary labels +" ?" +" +" #" +" #" #" #" 3 / 57

  4. Learning from binary labels +" +" +" #" +" #" #" #" 4 / 57

  5. Learning from noisy labels #" #" +" +" +" #" +" #" 5 / 57

  6. Learning from positive and unlabelled data ?" ?" +" ?" +" ?" ?" ?" 6 / 57

  7. Learning from binary labels +" +" +" #" +" #" #" #" S ⇠ D n nature learner Goal : good classification wrt distribution D 7 / 57

  8. Learning from corrupted labels +" +" +" #" +" #" #" #" S ⇠ D n S ⇠ D n corruptor nature learner Goal : good classification wrt (unobserved) distribution D 8 / 57

  9. Paper summary Can we learn a good classifier from corrupted samples? 9 / 57

  10. Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! 10 / 57

  11. Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duPlessis and Sugiyama, 2014) ... 11 / 57

  12. Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duPlessis and Sugiyama, 2014) ... This work: unified treatment via class-probability estimation analysis for general class of corruptions 12 / 57

  13. Assumed corruption model 13 / 57

  14. Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X ⇥ {± 1 } Constituent components of D : ( P ( x ) , Q ( x ) , π ) = ( P [ X = x | Y = 1 ] , P [ X = x | Y = � 1 ] , P [ Y = 1 ]) 14 / 57

  15. Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X ⇥ {± 1 } Constituent components of D : ( P ( x ) , Q ( x ) , π ) = ( P [ X = x | Y = 1 ] , P [ X = x | Y = � 1 ] , P [ Y = 1 ]) ( M ( x ) , η ( x )) = ( P [ X = x ] , P [ Y = 1 | X = x ]) 15 / 57

  16. Learning from corrupted binary labels S ⇠ D n S ⇠ D n corruptor nature learner Samples from corrupted distribution D = ( P , Q , π ) Goal : good classification wrt (unobserved) distribution D 16 / 57

  17. Learning from corrupted binary labels S ⇠ D n S ⇠ D n corruptor nature learner Samples from corrupted distribution D = ( P , Q , π ) , where P = ( 1 � α ) · P + α · Q Q = β · P +( 1 � β ) · Q and π is arbitrary α , β are noise rates mutually contaminated distributions (Scott et al., 2013) Goal : good classification wrt (unobserved) distribution D 17 / 57

  18. Special cases Label noise PU learning Labels flipped w.p. ρ Observe M instead of Q π = ( 1 � 2 ρ ) · π + ρ π = arbitrary α = π � 1 · ( 1 � π ) · ρ P = 1 · P + 0 · Q Q = M β = ( 1 � π ) � 1 · π · ρ = π · P +( 1 � π ) · Q #" ?" #" ?" +" +" +" ?" +" +" #" ?" +" ?" ?" #" 18 / 57

  19. Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis 19 / 57

  20. Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D , D , η ( x ) = φ α , β , π ( η ( x )) where φ α , β , π is strictly monotone for fixed α , β , π . 20 / 57

  21. Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D , D , η ( x ) = φ α , β , π ( η ( x )) where φ α , β , π is strictly monotone for fixed α , β , π . Follows from Bayes’ rule: η ( x ) 1 � π · P ( x ) π 1 � η ( x ) = Q ( x ) 21 / 57

  22. Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D , D , η ( x ) = φ α , β , π ( η ( x )) where φ α , β , π is strictly monotone for fixed α , β , π . Follows from Bayes’ rule: ( 1 � α ) · P ( x ) Q ( x ) + α η ( x ) 1 � π · P ( x ) π π 1 � η ( x ) = Q ( x ) = 1 � π · . β · P ( x ) Q ( x ) +( 1 � β ) 22 / 57

  23. Corrupted class-probabilities: special cases Label noise PU learning π · η ( x ) η ( x ) = ( 1 � 2 ρ ) · η ( x )+ ρ η ( x ) = π · η ( x )+( 1 � π ) · π ρ unknown π unknown (Natarajan et al., 2013) (Ward et al., 2009) 23 / 57

  24. Roadmap ˆ η D D class-prob corruptor nature classifier estimator Kernel logistic regression 24 / 57

  25. Roadmap Exploit monotone relationship between η and η ˆ η D D class-prob ? corruptor nature classifier estimator Kernel logistic regression 25 / 57

  26. Classification with noise rates 26 / 57

  27. Class-probabilities and classification Many classification measures optimised by sign ( η ( x ) � t ) 0-1 error ! t = 1 2 Balanced error ! t = π F-score ! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) 27 / 57

  28. Class-probabilities and classification Many classification measures optimised by sign ( η ( x ) � t ) 0-1 error ! t = 1 2 Balanced error ! t = π F-score ! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) We can relate this to thresholding of η ! 28 / 57

  29. Corrupted class-probabilities and classification By monotone relationship, η ( x ) > t ( ) η ( x ) > φ α , β , π ( t ) . Threshold η at φ α , β , π ( t ) ! optimal classification on D Can translate into regret bound e.g. for 0-1 loss 29 / 57

  30. Story so far Classification scheme requires: η t α , β , π noise oracle α , ˆ ˆ β , ˆ π ˆ η class-prob D D sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator 30 / 57

  31. Story so far Classification scheme requires: η ! class-probability estimation t α , β , π noise oracle α , ˆ ˆ β , ˆ π ˆ η class-prob D D sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator Kernel logistic regression 31 / 57

  32. Story so far Classification scheme requires: η ! class-probability estimation t ! if unknown, alternate approach (see poster) α , β , π noise oracle α , ˆ ˆ β , ˆ π ˆ η class-prob D D sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator Kernel logistic regression 32 / 57

  33. Story so far Classification scheme requires: η ! class-probability estimation t ! if unknown, alternate approach (see poster) α , β , π ! can we estimate these? noise ? estimator α , ˆ ˆ β , ˆ π ˆ η D D class-prob sign ( ˆ η ( x ) � φ ˆ π ( t )) nature corruptor classifier α , ˆ β , ˆ estimator Kernel logistic regression 33 / 57

  34. Estimating noise rates: some bad news π strongly non-identifiable! π allowed to be arbitrary (e.g. PU learning) α , β non-identifiable without assumptions (Scott et al., 2013) Can we estimate α , β under assumptions? 34 / 57

  35. Weak separability assumption Assume that D is “weakly separable”: x 2 X η ( x ) = 0 min x 2 X η ( x ) = 1 max i.e. 9 deterministically +’ve and -’ve instances weaker than full separability 35 / 57

  36. Weak separability assumption Assume that D is “weakly separable”: x 2 X η ( x ) = 0 min x 2 X η ( x ) = 1 max i.e. 9 deterministically +’ve and -’ve instances weaker than full separability Assumed range of η constrains observed range of η ! 36 / 57

  37. Estimating noise rates Proposition Pick any weakly separable D . Then, for any D , α = η min · ( η max � π ) π · ( η max � η min ) and β = ( 1 � η max ) · ( π � η min ) ( 1 � π ) · ( η max � η min ) where η min = min x 2 X η ( x ) η max = max x 2 X η ( x ) α , β can be estimated from corrupted data alone 37 / 57

  38. Estimating noise rates: special cases Label noise PU learning ρ = 1 � η max α = 0 = η min β = π π � η min = 1 � η max π π = · η max � η min η max 1 � π (Elkan and Noto, 2008), (Liu and Tao, 2014) c.f. mixture proportion estimate of (Scott et al., 2013) In these cases, π can be estimated as well 38 / 57

  39. Story so far Optimal classification in general requires α , β , π Range of ˆ η noise estimator ˆ η α , ˆ ˆ β , ˆ π ˆ η D D class-prob sign ( ˆ corruptor η ( x ) � φ ˆ π ( t )) nature classifier α , ˆ β , ˆ estimator Kernel logistic regression 39 / 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend