mlss cc
play

MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 - PowerPoint PPT Presentation

Information, Divergence and Risk for Binary Classification Mark Reid* [mark.reid@anu.edu.au] Research School of Information Science and Engineering The Australian National University, Canberra, ACT, Australia MLSS .cc Machine Learning Summer


  1. Information, Divergence and Risk for Binary Classification Mark Reid* [mark.reid@anu.edu.au] Research School of Information Science and Engineering The Australian National University, Canberra, ACT, Australia MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 *Joint work with Robert Williamson

  2. Introduction

  3. The Blind Men & The Elephant F -D IVERGENCE S TATISTICAL B REGMAN AUC I NFORMATION D IVERGENCE C OST C URVES

  4. Overview Convex function representations Measures of Divergence • Integral (Taylor’s theorem) • Csiszár and Bregman divergences • Variational (LF Dual) • Loss, Risk and Regret Binary Experiments • Statistical Information • Distinguishing between two Representations probability distributions or classes • Loss and Divergence Classification Problems Bounds and Applications • Distinguishing between two • Reductions distributions, for each instance • Loss and Pinsker Bounds

  5. What’s in it for me? What to expect What not to expect • Lots of definitions • Algorithms • Various points of view on the • Models same concepts • Sample complexity analysis • Relationships between those ‣ Everything is idealised - i.e., concepts assuming complete data. • An emphasis on problems over • Technicalities techniques

  6. Part I : Convexity and Binary Experiments

  7. Overview Convex Functions Class Probability Estimation • Definitions & Properties • Generative/Discriminative Views • Fenchel & Csiszár Duals • Loss, Risk, Regret • Taylor Expansion • Savage’s Theorem • The Jensen Gap • Statistical Information • Bregman Information Binary Experiments and Divergence • Definitions & Examples • Statistics • Neyman-Pearson Lemma • Bregman & f-Divergence

  8. Convex Functions and their Representations

  9. Convex Sets Convex • Given points and weights x 1 , . . . , x n � n such that λ 1 , . . . , λ n ≥ 0 i =1 λ i = 1 their convex combination is x 2 n x 1 � λ i x i i =1 • We say is a convex set if it is S ⊆ R d closed under convex combination. That is, for any n, any x 1 , . . . , x n ⊂ S Not Convex and weights λ 1 , . . . , λ n ≥ 0 n � λ i x i ∈ S i =1 • Suffices to show for all and x 2 λ ∈ [0 , 1] that x 1 , x 2 ∈ S x 1 λ x 1 + (1 − λ ) x 2 ∈ S

  10. Convex Functions • The epigraph of a function is the set of points that lie above it: f ( x ) epi ( f ) := { ( x , y ) : x ∈ R d , y ≥ f ( x ) } • A function is convex if its epigraph is epi( f ) a convex set f ( x 2 ) ‣ Lines interpolating any two points on its graph lie above it f ( x 1 ) ‣ A convex function is necessarily continuous ‣ A point-wise sum of convex x 1 x 2 functions is convex

  11. The Legendre-Fenchel Transform f ( t ) • The LF Transform generalises the notion of a derivative to non- differentiable functions slope t * f ∗ ( t ∗ ) = sup t ∈ R d { � t, t ∗ � − f ( t ) } • When f is differentiable at t f ∗ ( t ∗ ) = t ∗ .t − f (( f ′ ) − 1 ( t ∗ )) t slope t f* ( t* ) • The double LF transform f ∗∗ ( t ) = sup t ∗ ∈ R d { � t ∗ , t � − f ∗ ( t ∗ ) } is involutive for convex f. That is, f ∗∗ ( t ) = f ( t ) t *

  12. Taylor’s Theorem Integral Form of Taylor Expansion • Let be an interval on which f is twice differentiable. Then [ t 0 , t ] � t f ( t ) = f ( t 0 ) + ( t − t 0 ) f ′ ( t 0 ) + ( t − s ) f ′′ ( s ) ds t 0 Corollary • Let f be twice differentiable on [ a , b ]. Then, for all t in [ a , b ], � b f ( t ) = f ( t 0 ) + ( t − t 0 ) f ′ ( t 0 ) + g ( t, s ) f ′′ ( s ) ds a � ( t − s ) + s ≥ t 0 where g ( t, s ) = ( s − t ) + s < t 0 • Differentiability can be removed if f ’ and f ’’ are interpreted distributionally

  13. Bregman Divergence • A Bregman divergence is a general f ( t ) = t log ( t ) class of “distance” measures defined using convex functions B f ( t, t 0 ) := f ( t ) − f ( t 0 ) − � t − t 0 , ∇ f ( t 0 ) � f ( t ) • In 1-d case, is the non-linear B f ( t, t 0 ) B f ( t, t 0 ) part of the Taylor expansion of f � t ( t − s ) f ′′ ( s ) ds B f ( t, t 0 ) := f ( t 0 ) t 0 t 0 t

  14. Jensen’s Inequality Jensen Gap Jensen’s Inequality • For convex and • The Jensen Gap is non-negative f : R → R distribution P define for all P if and only if f is convex J P [ f ( x )] := E P [ f ( x )] − f ( E P [ x ]) Affine Invariance • For all values a , b f ( x 4 ) J P [ f ( x ) + bx + a ] = J P [ f ( x )] f ( x 1 ) E P [ f ( x )] Taylor Expansion �� b � g x 0 ( x, s ) f ′′ ( s ) ds J P [ f ( x )] = J P J P [ f ( x )] f ( x 3 ) a � b f ( x 2 ) J P [ g x 0 ( x, s )] f ′′ ( s ) ds = f ( E P [ x ]) a x 1 x 2 E [ x ] x 3 x 4

  15. Representations of Convex Functions Integral Representation Variational Representation • Via Taylor’s Theorem • Via Fenchel Dual � b { t.t ∗ − f ∗ ( t ∗ ) } g ( t, s ) f ′′ ( s ) ds f ( t ) = sup f ( t ) = Λ f ( t ) + t ∗ ∈ R a where where f ∗ ( t ) = sup { t.t ∗ − f ( t ) } Λ f ( t ) = f ( t 0 ) + f ′ ( t 0 )( t − t 0 ) t ∈ R � ( t − s ) + s ≥ t 0 g ( t, s ) = ( s − t ) + s < t 0

  16. Binary Experiments and Measures of Divergence

  17. Binary Experiments Discrete Space • A binary experiment is a pair of 1.0 P distributions ( P , Q ) over the same Q 0.8 space X Probability 0.7 0.6 • We will think of P as the positive and 0.5 0.4 Q as the negative distribution 0.3 0.2 0.2 0.2 0.1 0 • Given samples from , how can we X a b c tell if they came from P or Q ? Continuous Space ‣ Hypothesis Testing Density dQ dP • The “further apart” P and Q are the easier this will be ‣ How do we define distance for distributions? 0 X

  18. Test Statistics X • We would like our distances to not be dependent on the topology of the underlying space + — • A test statistic maps each point in τ to a point on the real line X ‣ Usually a function of the distribution τ • A statistical test can be obtained by thresholding a test statistic r ( x ) = � τ ( x ) ≥ τ 0 � R τ 0 • Each threshold partitions space into positive and negative parts

  19. Statistical Power and Size Contingency Table Actual Class • True Positive Rate P ( τ ≥ τ 0 ) + – • False Positive Rate Q ( τ ≥ τ 0 ) True False • True Negative Rate + Q ( τ < τ 0 ) Predicted Class Positives Positives TP FP • False Negative Rate P ( τ < τ 0 ) Power False True – Negatives Negatives • = True Positive Rate = FN TN P ( τ ≥ τ 0 ) 1 − β Size • = False Positive Rate = Q ( τ ≥ τ 0 ) α

  20. The Neyman-Pearson Lemma Likelihood ratio τ ( x ) = dP dQ ( x ) 1 τ ∗ Neyman-Pearson Lemma (1933) True Positive Rate (TP) τ • The the likelihood ratio is the uniformly most powerful (UMP) statistical test ‣ Always has the largest TP Rate for any given FP rate 0 1 False Positive Rate (FP)

  21. Csiszár f-Divergence � � dP �� • f-divergence of P from Q is the I f ( P, Q ) = f E Q dQ Q-average of the likelihood ratio transformed by the function f � dP � � = f dQ dQ X ‣ f can be seen as a penalty for dP(x) ≠ dQ(x) � � dP �� • To be a divergence, we want I f ( P, Q ) = f E Q dQ ‣ ≥ 0 for all P , Q I f ( P, Q ) � � dP �� f E Q ≥ dQ ‣ = 0 for all Q I f ( Q, Q ) = f (1) • Jensen’s inequality requries � � dP �� ‣ f convex I f ( P, Q ) = f ≥ 0 J Q dQ ‣ f(1) = 0 “Jensen Gap”

  22. Properties and Examples Symmetry Examples 2.0 1.5 • Variational • I f ( P, Q ) = I f ⋄ ( Q, P ) 1.0 0.5 f ( t ) = | t − 1 | • 0.0 I f ( P, Q ) = I f ( Q, P ) ⇐ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ⇒ 3 f ( t ) = f ⋄ ( t ) + c ( t − 1) • KL-Divergence 2 1 f ( t ) = t ln t Closure 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 • Hellinger 1.0 0.8 • I af + bg = a I f + b I g √ 0.6 t − 1) 2 f ( t ) = ( 0.4 0.2 Affine Invariance 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 χ 2 • Pearson 4 3 • f ( t ) = ( t − 1) 2 I f = I g ⇐ ⇒ f ( t ) = g ( t ) + bt + a 2 1 0 • Triangular 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.0 f ( t ) = ( t − 1) 2 0.8 0.6 t + 1 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

  23. Bregman Divergence (Generative) Bregman Divergences • Measures the average divergence between the densities of P and Q B f ( P, Q ) := E M [ B f ( dP, dQ )] E M [ f ( dP ) − f ( dQ ) − ( dP − dQ ) f ′ ( dQ )] = • “Additive” analogue of f-divergence

  24. Bregman and f-Divergences • What is the relationship between the classes of (generative) Bregman divergences and f-divergences? Bregman Csiszár ‣ One “additive”, one Divergences f-divergences “multiplicative” • They only have KL divergence in common [Csiszár, 1995] I f ( P, Q ) = B f ( P, Q ) E M [ I f ( p, q )] = E M [ B f ( p, q )] KL Divergence ⇐ ⇒ f ( t ) = t log ( t ) − t + 1

  25. Classification and Probability Estimation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend