Hypothesis testing, information divergence and computational - PowerPoint PPT Presentation

Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/20

The Multiple Hypothesis Testing (MHT) problem Given a rv. X with n hypothesis H 1 : X ∼ P 1 , ..., H n : X ∼ P n , decide for a IID sample x 1 , ..., x m ∼ X which hypothesis holds true? P m correct = 1 − P m error Asymptotic regime: α = − 1 m log P m e , m → ∞ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/20

Bayesian hypothesis testing (preliminaries) prior probabilities : w i = Pr ( X ∼ P i ) > 0 (with � n i =1 w i = 1) conditional probabilities: Pr ( X = x | X ∼ P i ). n � Pr ( X = x ) = Pr ( X ∼ P i ) Pr ( X = x | X ∼ P i ) i =1 n � w i Pr ( X | P i ) = i =1 Let c i , j = cost of deciding H i when in fact H j is true. Matrix [ c ij ]= cost design matrix Let p i , j ( u ) = probability of making this decision using rule u . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/20

Bayesian detector Minimize the expected cost :   � � E X [ c ( r ( x ))] , c ( r ( x )) =  w i c i , j p i , j ( r ( x ))  i j � = i Special case: Probability of error P e obtained for c i , i = 0 and c i , j = 1 for i � = j :     � � P e = E X p i , j ( r ( x ))  w i   i j � = i The maximum a posteriori probability (MAP) rule considers classifying x : MAP ( x ) = argmax i ∈{ 1 ,..., n } w i p i ( x ) where p i ( x ) = Pr ( X = x | X ∼ P i ) are the conditional probabilities. → MAP Bayesian detector minimizes P e over all rules [8] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/20

Probability of error and divergences Without loss of generality, consider equal priors ( w 1 = w 2 = 1 2 ): � P e = p ( x ) min( Pr ( H 1 | x ) , Pr ( H 2 | x )) d ν ( x ) x ∈X ( P e > 0 as soon as supp p 1 ∩ supp p 2 � = ∅ ) From Bayes’ rule Pr ( H i | X = x ) = Pr ( H i ) Pr ( X = x | H i ) = w i p i ( x ) / p ( x ) Pr ( X = x ) P e = 1 � min( p 1 ( x ) , p 2 ( x )) d ν ( x ) 2 x ∈X Rewrite or bound P e using tricks of the trade: − | a − b | Trick 1. ∀ a , b ∈ R , min( a , b ) = a + b , 2 2 Trick 2. ∀ a , b > 0 , min( a , b ) ≤ min α ∈ (0 , 1) a α b 1 − α , c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/20

Probability of error and total variation 1 � � p 1 ( x ) + p 2 ( x ) − | p 1 ( x ) − p 2 ( x ) | � P e = d ν ( x ) , 2 2 2 x ∈X 1 � 1 − 1 � � = | p 1 ( x ) − p 2 ( x ) | d ν ( x ) 2 2 x ∈X P e = 1 2(1 − TV ( P 1 , P 2 )) total variation metric distance : TV ( P , Q ) = 1 � | p ( x ) − q ( x ) | d ν ( x ) 2 x ∈X → Difficult to compute when handling multivariate distributions. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/20

Bounding the Probability of error P e min( a , b ) ≤ min α ∈ (0 , 1) a α b 1 − α for a , b > 0, upper bound P e : 1 � P e = min( p 1 ( x ) , p 2 ( x )) d ν ( x ) 2 x ∈X 1 � p α 1 ( x ) p 1 − α ≤ min ( x ) d ν ( x ) . 2 2 α ∈ (0 , 1) x ∈X � p α 1 ( x ) p 1 − α C ( P 1 , P 2 ) = − log min ( x ) d ν ( x ) ≥ 0 , 2 α ∈ (0 , 1) x ∈X Best error exponent α ∗ [7]: e − C ( P 1 , P 2 ) ≤ e − C ( P 1 , P 2 ) P e ≤ w α ∗ 1 w 1 − α ∗ 2 Bounding technique can be extended using any quasi-arithmetic α -means [13, 9]... c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/20

Computational information geometry Exponential family manifold [4]: M = { p θ | p θ ( x ) = exp( t ( x ) ⊤ θ − F ( θ )) } Dually flat manifolds [1] enjoy dual affine connections [1]: ( M , ∇ 2 F ( θ ) , ∇ ( e ) , ∇ ( m ) ). θ = ∇ F ∗ ( η ) η = ∇ F ( θ ) , Canonical divergence from Young inequality: A ( θ 1 , η 2 ) = F ( θ 1 ) + F ∗ ( η 2 ) − θ ⊤ 1 η 2 ≥ 0 F ( θ ) + F ∗ ( η ) = θ ⊤ η c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/20

MAP decision rule and additive Bregman Voronoi diagrams KL ( p θ 1 : p θ 2 ) = B ( θ 2 : θ 1 ) = A ( θ 2 : η 1 ) = A ∗ ( η 1 : θ 2 ) = B ∗ ( η 1 : η 2 ) Canonical divergence (mixed primal/dual coordinates): A ( θ 2 : η 1 ) = F ( θ 2 ) + F ∗ ( η 1 ) − θ ⊤ 2 η 1 ≥ 0 Bregman divergence (uni-coordinates, primal or dual): B ( θ 2 : θ 1 ) = F ( θ 2 ) − F ( θ 1 ) − ( θ 2 − θ 1 ) ⊤ ∇ F ( θ 1 ) log p i ( x ) = − B ∗ ( t ( x ) : η i ) + F ∗ ( t ( x )) + k ( x ) , η i = ∇ F ( θ i ) = η ( P θ i ) Optimal MAP decision rule: MAP ( x ) = argmax i ∈{ 1 ,..., n } w i p i ( x ) argmax i ∈{ 1 ,..., n } − B ∗ ( t ( x ) : η i ) + log w i , = argmin i ∈{ 1 ,..., n } B ∗ ( t ( x ) : η i ) − log w i = → nearest neighbor classifier [2, 10, 15, 16] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/20

MAP & nearest neighbor classifier Bregman Voronoi diagrams (with additive weights) are affine diagrams [2]. argmin i ∈{ 1 ,..., n } B ∗ ( t ( x ) : η i ) − log w i ◮ point location in arrangement [3] (small dims), ◮ Divergence-based search trees [16], ◮ GPU brute force [6]. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/20

Geometry of the best error exponent: binary hypothesis On the exponential family manifold, Chernoff α -coefficient [5]: � ( x ) d µ ( x ) = exp( − J ( α ) p α θ 1 ( x ) p 1 − α c α ( P θ 1 : P θ 2 ) = F ( θ 1 : θ 2 )) , θ 2 Skew Jensen divergence [14] on the natural parameters: J ( α ) F ( θ 1 : θ 2 ) = α F ( θ 1 ) + (1 − α ) F ( θ 2 ) − F ( θ ( α ) 12 ) , Chernoff information = Bregman divergence for exponential families: C ( P θ 1 : P θ 2 ) = B ( θ 1 : θ ( α ∗ ) ) = B ( θ 2 : θ ( α ∗ ) 12 ) 12 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/20

Geometry of the best error exponent: binary hypothesis Chernoff distribution P ∗ [12]: P ∗ = P θ ∗ 12 = G e ( P 1 , P 2 ) ∩ Bi m ( P 1 , P 2 ) e -geodesic: G e ( P 1 , P 2 ) = { E ( λ ) | θ ( E ( λ ) 12 ) = (1 − λ ) θ 1 + λθ 2 , λ ∈ [0 , 1] } , 12 m -bisector: Bi m ( P 1 , P 2 ) : { P | F ( θ 1 ) − F ( θ 2 ) + η ( P ) ⊤ ∆ θ = 0 } , Optimal natural parameter of P ∗ : θ ∗ = θ ( α ∗ ) = argmin θ ∈ Θ B ( θ 1 : θ ) = argmin θ ∈ Θ B ( θ 2 : θ ) . 12 → closed-form for order-1 family, or efficient bisection search. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/20

Geometry of the best error exponent: binary hypothesis P ∗ = P θ ∗ 12 = G e ( P 1 , P 2 ) ∩ Bi m ( P 1 , P 2 ) η -coordinate system m -bisector Bi m ( P θ 1 , P θ 2 ) e -geodesic G e ( P θ 1 , P θ 2 ) p θ ∗ 12 p θ 2 P θ ∗ 12 p θ 1 C ( θ 1 : θ 2 ) = B ( θ 1 : θ ∗ 12 ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/20

Geometry of the best error exponent: multiple hypothesis n -ary MHT [8] from minimum pairwise Chernoff distance : C ( P 1 , ..., P n ) = min i , j � = i C ( P i , P j ) P m e ≤ e − mC ( P i ∗ , P j ∗ ) , ( i ∗ , j ∗ ) = argmin i , j � = i C ( P i , P j ) Compute for each pair of natural neighbors [3] P θ i and P θ j , the Chernoff distance C ( P θ i , P θ j ), and choose the pair with minimal distance. (Proof by contradiction using Bregman Pythagoras theorem.) → Closest Bregman pair problem (Chernoff distance fails triangle inequality). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/20

Hypothesis testing: Illustration η -coordinate system Chernoff distribution between natural neighbours c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/20

Summary Bayesian multiple hypothesis testing... ... from the viewpoint of computational geometry. ◮ probability of error & best MAP Bayesian rule ◮ total variation & P e , upper-bounded by the Chernoff distance. ◮ Exponential family manifolds: ◮ MAP rule = NN classifier (additive Bregman Voronoi diagram) ◮ best error exponent from intersection geodesic/bisector for binary hypothesis, ◮ best error exponent from closest Bregman pair for multiple hypothesis. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/20

Thank you 28th-30th August, Paris. @incollection{HTIGCG-GSI-2013, year={2013}, booktitle={Geometric Science of Information}, volume={8085}, series={Lecture Notes in Computer Science}, editor={Frank Nielsen and Fr\’ed\’eric Barbaresco}, title={Hypothesis testing, information divergence and computational geometry}, publisher={Springer Berlin Heidelberg}, author={Nielsen, Frank}, pages={241-248} } c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/20

Hypothesis testing, information divergence and computational - PowerPoint PPT Presentation

Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c 2013 Frank Nielsen, Sony Computer

STAT 113 Hypothesis Testing I Colin Reimer Dawson Oberlin College October 5, 2017 1 / 17

CME/STATS 195 CME/STATS 195 Lecture 7: Hypothesis Testing and Lecture 7: Hypothesis Testing and

Chapter 6 Hypothesis Testing What is Hypothesis Testing? the use of statistical

Chapter 6 Hypothesis Testing What is Hypothesis Testing? the use of statistical

STAT 215 Hypothesis Testing I Colin Reimer Dawson Oberlin College September 7, 2017 1 / 14

Gov 2000: 6. Hypothesis Testing Matthew Blackwell October 11, 2016 1 / 55 1. Hypothesis

Cluster Validity Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria

Testing Specification testing Michel Bierlaire Introduction to choice models Differences from

Hypothesis Testing Mark Lunt Centre for Epidemiology Versus Arthritis University of Manchester

Hypothesis tests with binomial example STAT 587 (Engineering) Iowa State University October 2,

t -tests STAT 587 (Engineering) Iowa State University October 2, 2020 Statistical hypothesis

Testing 6.1 Specification testing Michel Bierlaire A short reminder on hypothesis testing

Hypothesis testing get data that differ from the null hypothesis. If the data would be quite

Lecture 4: Hypothesis Testing Ani Manichaikul amanicha@jhsph.edu 20 April 2007 1 / 69 Steps of

CME/STATS 195 CME/STATS 195 Lecture 8: Hypothesis Testing and Lecture 8: Hypothesis Testing and

Hypothesis Testing and statistical preliminaries Stony Brook University CSE545, Spring 2019

Citation Segmentation from Sparse & Noisy Data: An Unsupervised Joint Inference Approach with

A New Implementation of Formats Performances based on GADTs Conclusion Beno t Vaugon

Memory CS Basics Introduction: memory? 3) Memory Real mode flat model Registers

New Developments at AIUB/CODE Stefan Schaer Astronomical Institute, University of Berne,

Advances in Programming Languages APL4: Coursework assigment topics Ian Stark School of

railML in Switzerland railML-Meeting Sept. 18, 2013 Paris Daniel Huerlimann OpenTrack Railway

DNS Podstawy sieci komputerowej hosts.txt Lata 70 C:\WINDOWS\system32\drivers\etc

USRP testbed for spectrum sensing of OFDM signals Anton Blad Department of Electrical Engineering

Sambuz

Useful Links

Newsletter

Mail Us

Hypothesis testing, information divergence and computational - PowerPoint PPT Presentation

Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c 2013 Frank Nielsen, Sony Computer

STAT 113 Hypothesis Testing I Colin Reimer Dawson Oberlin College October 5, 2017 1 / 17

CME/STATS 195 CME/STATS 195 Lecture 7: Hypothesis Testing and Lecture 7: Hypothesis Testing and

Chapter 6 Hypothesis Testing What is Hypothesis Testing? the use of statistical

Chapter 6 Hypothesis Testing What is Hypothesis Testing? the use of statistical

STAT 215 Hypothesis Testing I Colin Reimer Dawson Oberlin College September 7, 2017 1 / 14

Gov 2000: 6. Hypothesis Testing Matthew Blackwell October 11, 2016 1 / 55 1. Hypothesis

Cluster Validity Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria

Testing Specification testing Michel Bierlaire Introduction to choice models Differences from

Hypothesis Testing Mark Lunt Centre for Epidemiology Versus Arthritis University of Manchester

Hypothesis tests with binomial example STAT 587 (Engineering) Iowa State University October 2,

t -tests STAT 587 (Engineering) Iowa State University October 2, 2020 Statistical hypothesis

Testing 6.1 Specification testing Michel Bierlaire A short reminder on hypothesis testing

Hypothesis testing get data that differ from the null hypothesis. If the data would be quite

Lecture 4: Hypothesis Testing Ani Manichaikul amanicha@jhsph.edu 20 April 2007 1 / 69 Steps of

CME/STATS 195 CME/STATS 195 Lecture 8: Hypothesis Testing and Lecture 8: Hypothesis Testing and

Hypothesis Testing and statistical preliminaries Stony Brook University CSE545, Spring 2019

Citation Segmentation from Sparse &amp; Noisy Data: An Unsupervised Joint Inference Approach with

A New Implementation of Formats Performances based on GADTs Conclusion Beno t Vaugon

Memory CS Basics Introduction: memory? 3) Memory Real mode flat model Registers

New Developments at AIUB/CODE Stefan Schaer Astronomical Institute, University of Berne,

Advances in Programming Languages APL4: Coursework assigment topics Ian Stark School of

railML in Switzerland railML-Meeting Sept. 18, 2013 Paris Daniel Huerlimann OpenTrack Railway

DNS Podstawy sieci komputerowej hosts.txt Lata 70 C:\WINDOWS\system32\drivers\etc

USRP testbed for spectrum sensing of OFDM signals Anton Blad Department of Electrical Engineering

Sambuz

Useful Links

Newsletter

Mail Us

Citation Segmentation from Sparse & Noisy Data: An Unsupervised Joint Inference Approach with