 
              Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/20
The Multiple Hypothesis Testing (MHT) problem Given a rv. X with n hypothesis H 1 : X ∼ P 1 , ..., H n : X ∼ P n , decide for a IID sample x 1 , ..., x m ∼ X which hypothesis holds true? P m correct = 1 − P m error Asymptotic regime: α = − 1 m log P m e , m → ∞ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/20
Bayesian hypothesis testing (preliminaries) prior probabilities : w i = Pr ( X ∼ P i ) > 0 (with � n i =1 w i = 1) conditional probabilities: Pr ( X = x | X ∼ P i ). n � Pr ( X = x ) = Pr ( X ∼ P i ) Pr ( X = x | X ∼ P i ) i =1 n � w i Pr ( X | P i ) = i =1 Let c i , j = cost of deciding H i when in fact H j is true. Matrix [ c ij ]= cost design matrix Let p i , j ( u ) = probability of making this decision using rule u . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/20
Bayesian detector Minimize the expected cost :   � � E X [ c ( r ( x ))] , c ( r ( x )) =  w i c i , j p i , j ( r ( x ))  i j � = i Special case: Probability of error P e obtained for c i , i = 0 and c i , j = 1 for i � = j :     � � P e = E X p i , j ( r ( x ))  w i   i j � = i The maximum a posteriori probability (MAP) rule considers classifying x : MAP ( x ) = argmax i ∈{ 1 ,..., n } w i p i ( x ) where p i ( x ) = Pr ( X = x | X ∼ P i ) are the conditional probabilities. → MAP Bayesian detector minimizes P e over all rules [8] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/20
Probability of error and divergences Without loss of generality, consider equal priors ( w 1 = w 2 = 1 2 ): � P e = p ( x ) min( Pr ( H 1 | x ) , Pr ( H 2 | x )) d ν ( x ) x ∈X ( P e > 0 as soon as supp p 1 ∩ supp p 2 � = ∅ ) From Bayes’ rule Pr ( H i | X = x ) = Pr ( H i ) Pr ( X = x | H i ) = w i p i ( x ) / p ( x ) Pr ( X = x ) P e = 1 � min( p 1 ( x ) , p 2 ( x )) d ν ( x ) 2 x ∈X Rewrite or bound P e using tricks of the trade: − | a − b | Trick 1. ∀ a , b ∈ R , min( a , b ) = a + b , 2 2 Trick 2. ∀ a , b > 0 , min( a , b ) ≤ min α ∈ (0 , 1) a α b 1 − α , c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/20
Probability of error and total variation 1 � � p 1 ( x ) + p 2 ( x ) − | p 1 ( x ) − p 2 ( x ) | � P e = d ν ( x ) , 2 2 2 x ∈X 1 � 1 − 1 � � = | p 1 ( x ) − p 2 ( x ) | d ν ( x ) 2 2 x ∈X P e = 1 2(1 − TV ( P 1 , P 2 )) total variation metric distance : TV ( P , Q ) = 1 � | p ( x ) − q ( x ) | d ν ( x ) 2 x ∈X → Difficult to compute when handling multivariate distributions. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/20
Bounding the Probability of error P e min( a , b ) ≤ min α ∈ (0 , 1) a α b 1 − α for a , b > 0, upper bound P e : 1 � P e = min( p 1 ( x ) , p 2 ( x )) d ν ( x ) 2 x ∈X 1 � p α 1 ( x ) p 1 − α ≤ min ( x ) d ν ( x ) . 2 2 α ∈ (0 , 1) x ∈X � p α 1 ( x ) p 1 − α C ( P 1 , P 2 ) = − log min ( x ) d ν ( x ) ≥ 0 , 2 α ∈ (0 , 1) x ∈X Best error exponent α ∗ [7]: e − C ( P 1 , P 2 ) ≤ e − C ( P 1 , P 2 ) P e ≤ w α ∗ 1 w 1 − α ∗ 2 Bounding technique can be extended using any quasi-arithmetic α -means [13, 9]... c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/20
Computational information geometry Exponential family manifold [4]: M = { p θ | p θ ( x ) = exp( t ( x ) ⊤ θ − F ( θ )) } Dually flat manifolds [1] enjoy dual affine connections [1]: ( M , ∇ 2 F ( θ ) , ∇ ( e ) , ∇ ( m ) ). θ = ∇ F ∗ ( η ) η = ∇ F ( θ ) , Canonical divergence from Young inequality: A ( θ 1 , η 2 ) = F ( θ 1 ) + F ∗ ( η 2 ) − θ ⊤ 1 η 2 ≥ 0 F ( θ ) + F ∗ ( η ) = θ ⊤ η c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/20
MAP decision rule and additive Bregman Voronoi diagrams KL ( p θ 1 : p θ 2 ) = B ( θ 2 : θ 1 ) = A ( θ 2 : η 1 ) = A ∗ ( η 1 : θ 2 ) = B ∗ ( η 1 : η 2 ) Canonical divergence (mixed primal/dual coordinates): A ( θ 2 : η 1 ) = F ( θ 2 ) + F ∗ ( η 1 ) − θ ⊤ 2 η 1 ≥ 0 Bregman divergence (uni-coordinates, primal or dual): B ( θ 2 : θ 1 ) = F ( θ 2 ) − F ( θ 1 ) − ( θ 2 − θ 1 ) ⊤ ∇ F ( θ 1 ) log p i ( x ) = − B ∗ ( t ( x ) : η i ) + F ∗ ( t ( x )) + k ( x ) , η i = ∇ F ( θ i ) = η ( P θ i ) Optimal MAP decision rule: MAP ( x ) = argmax i ∈{ 1 ,..., n } w i p i ( x ) argmax i ∈{ 1 ,..., n } − B ∗ ( t ( x ) : η i ) + log w i , = argmin i ∈{ 1 ,..., n } B ∗ ( t ( x ) : η i ) − log w i = → nearest neighbor classifier [2, 10, 15, 16] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/20
MAP & nearest neighbor classifier Bregman Voronoi diagrams (with additive weights) are affine diagrams [2]. argmin i ∈{ 1 ,..., n } B ∗ ( t ( x ) : η i ) − log w i ◮ point location in arrangement [3] (small dims), ◮ Divergence-based search trees [16], ◮ GPU brute force [6]. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/20
Geometry of the best error exponent: binary hypothesis On the exponential family manifold, Chernoff α -coefficient [5]: � ( x ) d µ ( x ) = exp( − J ( α ) p α θ 1 ( x ) p 1 − α c α ( P θ 1 : P θ 2 ) = F ( θ 1 : θ 2 )) , θ 2 Skew Jensen divergence [14] on the natural parameters: J ( α ) F ( θ 1 : θ 2 ) = α F ( θ 1 ) + (1 − α ) F ( θ 2 ) − F ( θ ( α ) 12 ) , Chernoff information = Bregman divergence for exponential families: C ( P θ 1 : P θ 2 ) = B ( θ 1 : θ ( α ∗ ) ) = B ( θ 2 : θ ( α ∗ ) 12 ) 12 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/20
Geometry of the best error exponent: binary hypothesis Chernoff distribution P ∗ [12]: P ∗ = P θ ∗ 12 = G e ( P 1 , P 2 ) ∩ Bi m ( P 1 , P 2 ) e -geodesic: G e ( P 1 , P 2 ) = { E ( λ ) | θ ( E ( λ ) 12 ) = (1 − λ ) θ 1 + λθ 2 , λ ∈ [0 , 1] } , 12 m -bisector: Bi m ( P 1 , P 2 ) : { P | F ( θ 1 ) − F ( θ 2 ) + η ( P ) ⊤ ∆ θ = 0 } , Optimal natural parameter of P ∗ : θ ∗ = θ ( α ∗ ) = argmin θ ∈ Θ B ( θ 1 : θ ) = argmin θ ∈ Θ B ( θ 2 : θ ) . 12 → closed-form for order-1 family, or efficient bisection search. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/20
Geometry of the best error exponent: binary hypothesis P ∗ = P θ ∗ 12 = G e ( P 1 , P 2 ) ∩ Bi m ( P 1 , P 2 ) η -coordinate system m -bisector Bi m ( P θ 1 , P θ 2 ) e -geodesic G e ( P θ 1 , P θ 2 ) p θ ∗ 12 p θ 2 P θ ∗ 12 p θ 1 C ( θ 1 : θ 2 ) = B ( θ 1 : θ ∗ 12 ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/20
Geometry of the best error exponent: multiple hypothesis n -ary MHT [8] from minimum pairwise Chernoff distance : C ( P 1 , ..., P n ) = min i , j � = i C ( P i , P j ) P m e ≤ e − mC ( P i ∗ , P j ∗ ) , ( i ∗ , j ∗ ) = argmin i , j � = i C ( P i , P j ) Compute for each pair of natural neighbors [3] P θ i and P θ j , the Chernoff distance C ( P θ i , P θ j ), and choose the pair with minimal distance. (Proof by contradiction using Bregman Pythagoras theorem.) → Closest Bregman pair problem (Chernoff distance fails triangle inequality). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/20
Hypothesis testing: Illustration η -coordinate system Chernoff distribution between natural neighbours c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/20
Summary Bayesian multiple hypothesis testing... ... from the viewpoint of computational geometry. ◮ probability of error & best MAP Bayesian rule ◮ total variation & P e , upper-bounded by the Chernoff distance. ◮ Exponential family manifolds: ◮ MAP rule = NN classifier (additive Bregman Voronoi diagram) ◮ best error exponent from intersection geodesic/bisector for binary hypothesis, ◮ best error exponent from closest Bregman pair for multiple hypothesis. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/20
Thank you 28th-30th August, Paris. @incollection{HTIGCG-GSI-2013, year={2013}, booktitle={Geometric Science of Information}, volume={8085}, series={Lecture Notes in Computer Science}, editor={Frank Nielsen and Fr\’ed\’eric Barbaresco}, title={Hypothesis testing, information divergence and computational geometry}, publisher={Springer Berlin Heidelberg}, author={Nielsen, Frank}, pages={241-248} } c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/20
Recommend
More recommend