hypothesis testing information divergence and
play

Hypothesis testing, information divergence and computational - PowerPoint PPT Presentation

Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c 2013 Frank Nielsen, Sony Computer


  1. Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/20

  2. The Multiple Hypothesis Testing (MHT) problem Given a rv. X with n hypothesis H 1 : X ∼ P 1 , ..., H n : X ∼ P n , decide for a IID sample x 1 , ..., x m ∼ X which hypothesis holds true? P m correct = 1 − P m error Asymptotic regime: α = − 1 m log P m e , m → ∞ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/20

  3. Bayesian hypothesis testing (preliminaries) prior probabilities : w i = Pr ( X ∼ P i ) > 0 (with � n i =1 w i = 1) conditional probabilities: Pr ( X = x | X ∼ P i ). n � Pr ( X = x ) = Pr ( X ∼ P i ) Pr ( X = x | X ∼ P i ) i =1 n � w i Pr ( X | P i ) = i =1 Let c i , j = cost of deciding H i when in fact H j is true. Matrix [ c ij ]= cost design matrix Let p i , j ( u ) = probability of making this decision using rule u . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/20

  4. Bayesian detector Minimize the expected cost :   � � E X [ c ( r ( x ))] , c ( r ( x )) =  w i c i , j p i , j ( r ( x ))  i j � = i Special case: Probability of error P e obtained for c i , i = 0 and c i , j = 1 for i � = j :     � � P e = E X p i , j ( r ( x ))  w i   i j � = i The maximum a posteriori probability (MAP) rule considers classifying x : MAP ( x ) = argmax i ∈{ 1 ,..., n } w i p i ( x ) where p i ( x ) = Pr ( X = x | X ∼ P i ) are the conditional probabilities. → MAP Bayesian detector minimizes P e over all rules [8] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/20

  5. Probability of error and divergences Without loss of generality, consider equal priors ( w 1 = w 2 = 1 2 ): � P e = p ( x ) min( Pr ( H 1 | x ) , Pr ( H 2 | x )) d ν ( x ) x ∈X ( P e > 0 as soon as supp p 1 ∩ supp p 2 � = ∅ ) From Bayes’ rule Pr ( H i | X = x ) = Pr ( H i ) Pr ( X = x | H i ) = w i p i ( x ) / p ( x ) Pr ( X = x ) P e = 1 � min( p 1 ( x ) , p 2 ( x )) d ν ( x ) 2 x ∈X Rewrite or bound P e using tricks of the trade: − | a − b | Trick 1. ∀ a , b ∈ R , min( a , b ) = a + b , 2 2 Trick 2. ∀ a , b > 0 , min( a , b ) ≤ min α ∈ (0 , 1) a α b 1 − α , c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/20

  6. Probability of error and total variation 1 � � p 1 ( x ) + p 2 ( x ) − | p 1 ( x ) − p 2 ( x ) | � P e = d ν ( x ) , 2 2 2 x ∈X 1 � 1 − 1 � � = | p 1 ( x ) − p 2 ( x ) | d ν ( x ) 2 2 x ∈X P e = 1 2(1 − TV ( P 1 , P 2 )) total variation metric distance : TV ( P , Q ) = 1 � | p ( x ) − q ( x ) | d ν ( x ) 2 x ∈X → Difficult to compute when handling multivariate distributions. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/20

  7. Bounding the Probability of error P e min( a , b ) ≤ min α ∈ (0 , 1) a α b 1 − α for a , b > 0, upper bound P e : 1 � P e = min( p 1 ( x ) , p 2 ( x )) d ν ( x ) 2 x ∈X 1 � p α 1 ( x ) p 1 − α ≤ min ( x ) d ν ( x ) . 2 2 α ∈ (0 , 1) x ∈X � p α 1 ( x ) p 1 − α C ( P 1 , P 2 ) = − log min ( x ) d ν ( x ) ≥ 0 , 2 α ∈ (0 , 1) x ∈X Best error exponent α ∗ [7]: e − C ( P 1 , P 2 ) ≤ e − C ( P 1 , P 2 ) P e ≤ w α ∗ 1 w 1 − α ∗ 2 Bounding technique can be extended using any quasi-arithmetic α -means [13, 9]... c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/20

  8. Computational information geometry Exponential family manifold [4]: M = { p θ | p θ ( x ) = exp( t ( x ) ⊤ θ − F ( θ )) } Dually flat manifolds [1] enjoy dual affine connections [1]: ( M , ∇ 2 F ( θ ) , ∇ ( e ) , ∇ ( m ) ). θ = ∇ F ∗ ( η ) η = ∇ F ( θ ) , Canonical divergence from Young inequality: A ( θ 1 , η 2 ) = F ( θ 1 ) + F ∗ ( η 2 ) − θ ⊤ 1 η 2 ≥ 0 F ( θ ) + F ∗ ( η ) = θ ⊤ η c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/20

  9. MAP decision rule and additive Bregman Voronoi diagrams KL ( p θ 1 : p θ 2 ) = B ( θ 2 : θ 1 ) = A ( θ 2 : η 1 ) = A ∗ ( η 1 : θ 2 ) = B ∗ ( η 1 : η 2 ) Canonical divergence (mixed primal/dual coordinates): A ( θ 2 : η 1 ) = F ( θ 2 ) + F ∗ ( η 1 ) − θ ⊤ 2 η 1 ≥ 0 Bregman divergence (uni-coordinates, primal or dual): B ( θ 2 : θ 1 ) = F ( θ 2 ) − F ( θ 1 ) − ( θ 2 − θ 1 ) ⊤ ∇ F ( θ 1 ) log p i ( x ) = − B ∗ ( t ( x ) : η i ) + F ∗ ( t ( x )) + k ( x ) , η i = ∇ F ( θ i ) = η ( P θ i ) Optimal MAP decision rule: MAP ( x ) = argmax i ∈{ 1 ,..., n } w i p i ( x ) argmax i ∈{ 1 ,..., n } − B ∗ ( t ( x ) : η i ) + log w i , = argmin i ∈{ 1 ,..., n } B ∗ ( t ( x ) : η i ) − log w i = → nearest neighbor classifier [2, 10, 15, 16] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/20

  10. MAP & nearest neighbor classifier Bregman Voronoi diagrams (with additive weights) are affine diagrams [2]. argmin i ∈{ 1 ,..., n } B ∗ ( t ( x ) : η i ) − log w i ◮ point location in arrangement [3] (small dims), ◮ Divergence-based search trees [16], ◮ GPU brute force [6]. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/20

  11. Geometry of the best error exponent: binary hypothesis On the exponential family manifold, Chernoff α -coefficient [5]: � ( x ) d µ ( x ) = exp( − J ( α ) p α θ 1 ( x ) p 1 − α c α ( P θ 1 : P θ 2 ) = F ( θ 1 : θ 2 )) , θ 2 Skew Jensen divergence [14] on the natural parameters: J ( α ) F ( θ 1 : θ 2 ) = α F ( θ 1 ) + (1 − α ) F ( θ 2 ) − F ( θ ( α ) 12 ) , Chernoff information = Bregman divergence for exponential families: C ( P θ 1 : P θ 2 ) = B ( θ 1 : θ ( α ∗ ) ) = B ( θ 2 : θ ( α ∗ ) 12 ) 12 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/20

  12. Geometry of the best error exponent: binary hypothesis Chernoff distribution P ∗ [12]: P ∗ = P θ ∗ 12 = G e ( P 1 , P 2 ) ∩ Bi m ( P 1 , P 2 ) e -geodesic: G e ( P 1 , P 2 ) = { E ( λ ) | θ ( E ( λ ) 12 ) = (1 − λ ) θ 1 + λθ 2 , λ ∈ [0 , 1] } , 12 m -bisector: Bi m ( P 1 , P 2 ) : { P | F ( θ 1 ) − F ( θ 2 ) + η ( P ) ⊤ ∆ θ = 0 } , Optimal natural parameter of P ∗ : θ ∗ = θ ( α ∗ ) = argmin θ ∈ Θ B ( θ 1 : θ ) = argmin θ ∈ Θ B ( θ 2 : θ ) . 12 → closed-form for order-1 family, or efficient bisection search. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/20

  13. Geometry of the best error exponent: binary hypothesis P ∗ = P θ ∗ 12 = G e ( P 1 , P 2 ) ∩ Bi m ( P 1 , P 2 ) η -coordinate system m -bisector Bi m ( P θ 1 , P θ 2 ) e -geodesic G e ( P θ 1 , P θ 2 ) p θ ∗ 12 p θ 2 P θ ∗ 12 p θ 1 C ( θ 1 : θ 2 ) = B ( θ 1 : θ ∗ 12 ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/20

  14. Geometry of the best error exponent: multiple hypothesis n -ary MHT [8] from minimum pairwise Chernoff distance : C ( P 1 , ..., P n ) = min i , j � = i C ( P i , P j ) P m e ≤ e − mC ( P i ∗ , P j ∗ ) , ( i ∗ , j ∗ ) = argmin i , j � = i C ( P i , P j ) Compute for each pair of natural neighbors [3] P θ i and P θ j , the Chernoff distance C ( P θ i , P θ j ), and choose the pair with minimal distance. (Proof by contradiction using Bregman Pythagoras theorem.) → Closest Bregman pair problem (Chernoff distance fails triangle inequality). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/20

  15. Hypothesis testing: Illustration η -coordinate system Chernoff distribution between natural neighbours c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/20

  16. Summary Bayesian multiple hypothesis testing... ... from the viewpoint of computational geometry. ◮ probability of error & best MAP Bayesian rule ◮ total variation & P e , upper-bounded by the Chernoff distance. ◮ Exponential family manifolds: ◮ MAP rule = NN classifier (additive Bregman Voronoi diagram) ◮ best error exponent from intersection geodesic/bisector for binary hypothesis, ◮ best error exponent from closest Bregman pair for multiple hypothesis. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/20

  17. Thank you 28th-30th August, Paris. @incollection{HTIGCG-GSI-2013, year={2013}, booktitle={Geometric Science of Information}, volume={8085}, series={Lecture Notes in Computer Science}, editor={Frank Nielsen and Fr\’ed\’eric Barbaresco}, title={Hypothesis testing, information divergence and computational geometry}, publisher={Springer Berlin Heidelberg}, author={Nielsen, Frank}, pages={241-248} } c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend