MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 - PowerPoint PPT Presentation

Information, Divergence and Risk for Binary Classification Mark Reid* [mark.reid@anu.edu.au] Research School of Information Science and Engineering The Australian National University, Canberra, ACT, Australia MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 *Joint work with Robert Williamson

Introduction

The Blind Men & The Elephant F -D IVERGENCE S TATISTICAL B REGMAN AUC I NFORMATION D IVERGENCE C OST C URVES

Overview Convex function representations Measures of Divergence • Integral (Taylor’s theorem) • Csiszár and Bregman divergences • Variational (LF Dual) • Loss, Risk and Regret Binary Experiments • Statistical Information • Distinguishing between two Representations probability distributions or classes • Loss and Divergence Classification Problems Bounds and Applications • Distinguishing between two • Reductions distributions, for each instance • Loss and Pinsker Bounds

What’s in it for me? What to expect What not to expect • Lots of definitions • Algorithms • Various points of view on the • Models same concepts • Sample complexity analysis • Relationships between those ‣ Everything is idealised - i.e., concepts assuming complete data. • An emphasis on problems over • Technicalities techniques

Part I : Convexity and Binary Experiments

Overview Convex Functions Class Probability Estimation • Definitions & Properties • Generative/Discriminative Views • Fenchel & Csiszár Duals • Loss, Risk, Regret • Taylor Expansion • Savage’s Theorem • The Jensen Gap • Statistical Information • Bregman Information Binary Experiments and Divergence • Definitions & Examples • Statistics • Neyman-Pearson Lemma • Bregman & f-Divergence

Convex Functions and their Representations

Convex Sets Convex • Given points and weights x 1 , . . . , x n � n such that λ 1 , . . . , λ n ≥ 0 i =1 λ i = 1 their convex combination is x 2 n x 1 � λ i x i i =1 • We say is a convex set if it is S ⊆ R d closed under convex combination. That is, for any n, any x 1 , . . . , x n ⊂ S Not Convex and weights λ 1 , . . . , λ n ≥ 0 n � λ i x i ∈ S i =1 • Suffices to show for all and x 2 λ ∈ [0 , 1] that x 1 , x 2 ∈ S x 1 λ x 1 + (1 − λ ) x 2 ∈ S

Convex Functions • The epigraph of a function is the set of points that lie above it: f ( x ) epi ( f ) := { ( x , y ) : x ∈ R d , y ≥ f ( x ) } • A function is convex if its epigraph is epi( f ) a convex set f ( x 2 ) ‣ Lines interpolating any two points on its graph lie above it f ( x 1 ) ‣ A convex function is necessarily continuous ‣ A point-wise sum of convex x 1 x 2 functions is convex

The Legendre-Fenchel Transform f ( t ) • The LF Transform generalises the notion of a derivative to non- differentiable functions slope t * f ∗ ( t ∗ ) = sup t ∈ R d { � t, t ∗ � − f ( t ) } • When f is differentiable at t f ∗ ( t ∗ ) = t ∗ .t − f (( f ′ ) − 1 ( t ∗ )) t slope t f* ( t* ) • The double LF transform f ∗∗ ( t ) = sup t ∗ ∈ R d { � t ∗ , t � − f ∗ ( t ∗ ) } is involutive for convex f. That is, f ∗∗ ( t ) = f ( t ) t *

Taylor’s Theorem Integral Form of Taylor Expansion • Let be an interval on which f is twice differentiable. Then [ t 0 , t ] � t f ( t ) = f ( t 0 ) + ( t − t 0 ) f ′ ( t 0 ) + ( t − s ) f ′′ ( s ) ds t 0 Corollary • Let f be twice differentiable on [ a , b ]. Then, for all t in [ a , b ], � b f ( t ) = f ( t 0 ) + ( t − t 0 ) f ′ ( t 0 ) + g ( t, s ) f ′′ ( s ) ds a � ( t − s ) + s ≥ t 0 where g ( t, s ) = ( s − t ) + s < t 0 • Differentiability can be removed if f ’ and f ’’ are interpreted distributionally

Bregman Divergence • A Bregman divergence is a general f ( t ) = t log ( t ) class of “distance” measures defined using convex functions B f ( t, t 0 ) := f ( t ) − f ( t 0 ) − � t − t 0 , ∇ f ( t 0 ) � f ( t ) • In 1-d case, is the non-linear B f ( t, t 0 ) B f ( t, t 0 ) part of the Taylor expansion of f � t ( t − s ) f ′′ ( s ) ds B f ( t, t 0 ) := f ( t 0 ) t 0 t 0 t

Jensen’s Inequality Jensen Gap Jensen’s Inequality • For convex and • The Jensen Gap is non-negative f : R → R distribution P define for all P if and only if f is convex J P [ f ( x )] := E P [ f ( x )] − f ( E P [ x ]) Affine Invariance • For all values a , b f ( x 4 ) J P [ f ( x ) + bx + a ] = J P [ f ( x )] f ( x 1 ) E P [ f ( x )] Taylor Expansion �� b � g x 0 ( x, s ) f ′′ ( s ) ds J P [ f ( x )] = J P J P [ f ( x )] f ( x 3 ) a � b f ( x 2 ) J P [ g x 0 ( x, s )] f ′′ ( s ) ds = f ( E P [ x ]) a x 1 x 2 E [ x ] x 3 x 4

Representations of Convex Functions Integral Representation Variational Representation • Via Taylor’s Theorem • Via Fenchel Dual � b { t.t ∗ − f ∗ ( t ∗ ) } g ( t, s ) f ′′ ( s ) ds f ( t ) = sup f ( t ) = Λ f ( t ) + t ∗ ∈ R a where where f ∗ ( t ) = sup { t.t ∗ − f ( t ) } Λ f ( t ) = f ( t 0 ) + f ′ ( t 0 )( t − t 0 ) t ∈ R � ( t − s ) + s ≥ t 0 g ( t, s ) = ( s − t ) + s < t 0

Binary Experiments and Measures of Divergence

Binary Experiments Discrete Space • A binary experiment is a pair of 1.0 P distributions ( P , Q ) over the same Q 0.8 space X Probability 0.7 0.6 • We will think of P as the positive and 0.5 0.4 Q as the negative distribution 0.3 0.2 0.2 0.2 0.1 0 • Given samples from , how can we X a b c tell if they came from P or Q ? Continuous Space ‣ Hypothesis Testing Density dQ dP • The “further apart” P and Q are the easier this will be ‣ How do we define distance for distributions? 0 X

Test Statistics X • We would like our distances to not be dependent on the topology of the underlying space + — • A test statistic maps each point in τ to a point on the real line X ‣ Usually a function of the distribution τ • A statistical test can be obtained by thresholding a test statistic r ( x ) = � τ ( x ) ≥ τ 0 � R τ 0 • Each threshold partitions space into positive and negative parts

Statistical Power and Size Contingency Table Actual Class • True Positive Rate P ( τ ≥ τ 0 ) + – • False Positive Rate Q ( τ ≥ τ 0 ) True False • True Negative Rate + Q ( τ < τ 0 ) Predicted Class Positives Positives TP FP • False Negative Rate P ( τ < τ 0 ) Power False True – Negatives Negatives • = True Positive Rate = FN TN P ( τ ≥ τ 0 ) 1 − β Size • = False Positive Rate = Q ( τ ≥ τ 0 ) α

The Neyman-Pearson Lemma Likelihood ratio τ ( x ) = dP dQ ( x ) 1 τ ∗ Neyman-Pearson Lemma (1933) True Positive Rate (TP) τ • The the likelihood ratio is the uniformly most powerful (UMP) statistical test ‣ Always has the largest TP Rate for any given FP rate 0 1 False Positive Rate (FP)

Csiszár f-Divergence � � dP �� • f-divergence of P from Q is the I f ( P, Q ) = f E Q dQ Q-average of the likelihood ratio transformed by the function f � dP � � = f dQ dQ X ‣ f can be seen as a penalty for dP(x) ≠ dQ(x) � � dP �� • To be a divergence, we want I f ( P, Q ) = f E Q dQ ‣ ≥ 0 for all P , Q I f ( P, Q ) � � dP �� f E Q ≥ dQ ‣ = 0 for all Q I f ( Q, Q ) = f (1) • Jensen’s inequality requries � � dP �� ‣ f convex I f ( P, Q ) = f ≥ 0 J Q dQ ‣ f(1) = 0 “Jensen Gap”

Properties and Examples Symmetry Examples 2.0 1.5 • Variational • I f ( P, Q ) = I f ⋄ ( Q, P ) 1.0 0.5 f ( t ) = | t − 1 | • 0.0 I f ( P, Q ) = I f ( Q, P ) ⇐ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ⇒ 3 f ( t ) = f ⋄ ( t ) + c ( t − 1) • KL-Divergence 2 1 f ( t ) = t ln t Closure 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 • Hellinger 1.0 0.8 • I af + bg = a I f + b I g √ 0.6 t − 1) 2 f ( t ) = ( 0.4 0.2 Affine Invariance 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 χ 2 • Pearson 4 3 • f ( t ) = ( t − 1) 2 I f = I g ⇐ ⇒ f ( t ) = g ( t ) + bt + a 2 1 0 • Triangular 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.0 f ( t ) = ( t − 1) 2 0.8 0.6 t + 1 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Bregman Divergence (Generative) Bregman Divergences • Measures the average divergence between the densities of P and Q B f ( P, Q ) := E M [ B f ( dP, dQ )] E M [ f ( dP ) − f ( dQ ) − ( dP − dQ ) f ′ ( dQ )] = • “Additive” analogue of f-divergence

Bregman and f-Divergences • What is the relationship between the classes of (generative) Bregman divergences and f-divergences? Bregman Csiszár ‣ One “additive”, one Divergences f-divergences “multiplicative” • They only have KL divergence in common [Csiszár, 1995] I f ( P, Q ) = B f ( P, Q ) E M [ I f ( p, q )] = E M [ B f ( p, q )] KL Divergence ⇐ ⇒ f ( t ) = t log ( t ) − t + 1

Classification and Probability Estimation

MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 - PowerPoint PPT Presentation

Information, Divergence and Risk for Binary Classification Mark Reid* [mark.reid@anu.edu.au] Research School of Information Science and Engineering The Australian National University, Canberra, ACT, Australia MLSS .cc Machine Learning Summer

MLSS 2016 Prac<cal Machine Learning for Networks

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Universal Social Pension in Zambia (2008) Anthony Dumingu, MLSS Lusaka, 1 st June, 2017 1

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T ubingen, 2015 Arthur

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

Lecture 1: Introduction to RKHS MLSS Tbingen, 2015 Gatsby Unit, CSML, UCL July 22, 2015

POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Part 1 MLSS 2016 Cdiz Nicolas Le Roux Criteo Why such a class? Companies are an ever

Recommender Systems MLSS 14 Collaborative Filtering and other approaches Xavier Amatriain

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby

Probabilistic Programming Practical Frank Wood, Brooks Paige {fwood,brooks}@robots.ox.ac.uk MLSS

The Effects of Antecedents and Consequences on Accurate Identification of Function of Problem

Dark Energy with Large Scale Structure Beth Reid Cosmology Data Science Fellow Berkeley Center

Developing meaningful and in inclusive dia ialogue about short breaks through a Community of

Augmenting Dynamic Typing with Static-Analysis Reid Draper @reiddraper Reid Draper @reiddraper

CDS Rate Construction Methods by Machine Learning Techniques (Presentation Slides) Article in SSRN

Information Technology and Spatial Data Infrastructure for E-Government Hartmut Mller FIG

Energy-aware server provisioning Daniel Balouek-Thomert 12 Under the supervision of Eddy Caron,

The CARES Act: Relief for Yoga Businesses April 7, 2020 Craig Saperstein, Partner Aimee

MLSS .cc Machine Learning Summer School Thursday, 29 th January 2009 - PowerPoint PPT Presentation

Information, Divergence and Risk for Binary Classification Mark Reid* [mark.reid@anu.edu.au] Research School of Information Science and Engineering The Australian National University, Canberra, ACT, Australia MLSS .cc Machine Learning Summer

MLSS 2016 Prac&lt;cal Machine Learning for Networks

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Universal Social Pension in Zambia (2008) Anthony Dumingu, MLSS Lusaka, 1 st June, 2017 1

Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T ubingen, 2015 Arthur

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

Lecture 1: Introduction to RKHS MLSS Tbingen, 2015 Gatsby Unit, CSML, UCL July 22, 2015

POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Part 1 MLSS 2016 Cdiz Nicolas Le Roux Criteo Why such a class? Companies are an ever

Recommender Systems MLSS 14 Collaborative Filtering and other approaches Xavier Amatriain

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby

Probabilistic Programming Practical Frank Wood, Brooks Paige {fwood,brooks}@robots.ox.ac.uk MLSS

The Effects of Antecedents and Consequences on Accurate Identification of Function of Problem

Dark Energy with Large Scale Structure Beth Reid Cosmology Data Science Fellow Berkeley Center

Developing meaningful and in inclusive dia ialogue about short breaks through a Community of

Augmenting Dynamic Typing with Static-Analysis Reid Draper @reiddraper Reid Draper @reiddraper

CDS Rate Construction Methods by Machine Learning Techniques (Presentation Slides) Article in SSRN

Information Technology and Spatial Data Infrastructure for E-Government Hartmut Mller FIG

Energy-aware server provisioning Daniel Balouek-Thomert 12 Under the supervision of Eddy Caron,

The CARES Act: Relief for Yoga Businesses April 7, 2020 Craig Saperstein, Partner Aimee

MLSS 2016 Prac<cal Machine Learning for Networks