Hypothesis testing, information divergence and computational - - PowerPoint PPT Presentation

hypothesis testing information divergence and
SMART_READER_LITE
LIVE PREVIEW

Hypothesis testing, information divergence and computational - - PowerPoint PPT Presentation

Hypothesis testing, information divergence and computational geometry Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org Sony Computer Science Laboratories, Inc. August 2013, GSI, Paris, FR c 2013 Frank Nielsen, Sony Computer


slide-1
SLIDE 1

Hypothesis testing, information divergence and computational geometry

Frank Nielsen Frank.Nielsen@acm.org www.informationgeometry.org

Sony Computer Science Laboratories, Inc.

August 2013, GSI, Paris, FR

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/20

slide-2
SLIDE 2

The Multiple Hypothesis Testing (MHT) problem

Given a rv. X with n hypothesis H1 : X ∼ P1, ..., Hn : X ∼ Pn, decide for a IID sample x1, ..., xm ∼ X which hypothesis holds true? Pm

correct = 1 − Pm error

Asymptotic regime: α = − 1 m log Pm

e ,

m → ∞

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/20

slide-3
SLIDE 3

Bayesian hypothesis testing (preliminaries)

prior probabilities: wi = Pr(X ∼ Pi) > 0 (with n

i=1 wi = 1)

conditional probabilities: Pr(X = x|X ∼ Pi). Pr(X = x) =

n

  • i=1

Pr(X ∼ Pi)Pr(X = x|X ∼ Pi) =

n

  • i=1

wiPr(X|Pi) Let ci,j = cost of deciding Hi when in fact Hj is true. Matrix [cij]= cost design matrix Let pi,j(u) = probability of making this decision using rule u.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/20

slide-4
SLIDE 4

Bayesian detector

Minimize the expected cost: EX[c(r(x))], c(r(x)) =

  • i

 wi

  • j=i

ci,jpi,j(r(x))   Special case: Probability of error Pe obtained for ci,i = 0 and ci,j = 1 for i = j: Pe = EX  

i

 wi

  • j=i

pi,j(r(x))     The maximum a posteriori probability (MAP) rule considers classifying x: MAP(x) = argmaxi∈{1,...,n} wipi(x) where pi(x) = Pr(X = x|X ∼ Pi) are the conditional probabilities. → MAP Bayesian detector minimizes Pe over all rules [8]

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/20

slide-5
SLIDE 5

Probability of error and divergences

Without loss of generality, consider equal priors ( w1 = w2 = 1

2):

Pe =

  • x∈X

p(x) min(Pr(H1|x), Pr(H2|x))dν(x) (Pe > 0 as soon as suppp1 ∩ suppp2 = ∅) From Bayes’ rule Pr(Hi|X = x) = Pr(Hi)Pr(X=x|Hi)

Pr(X=x)

= wipi(x)/p(x) Pe = 1 2

  • x∈X

min(p1(x), p2(x))dν(x) Rewrite or bound Pe using tricks of the trade: Trick 1. ∀a, b ∈ R, min(a, b) = a+b

2

− |a−b|

2

, Trick 2. ∀a, b > 0, min(a, b) ≤ minα∈(0,1) aαb1−α,

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/20

slide-6
SLIDE 6

Probability of error and total variation

Pe = 1 2

  • x∈X

p1(x) + p2(x) 2 − |p1(x) − p2(x)| 2

  • dν(x),

= 1 2

  • 1 − 1

2

  • x∈X

|p1(x) − p2(x)|dν(x)

  • Pe = 1

2(1 − TV(P1, P2)) total variation metric distance: TV(P, Q) = 1 2

  • x∈X

|p(x) − q(x)|dν(x) → Difficult to compute when handling multivariate distributions.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/20

slide-7
SLIDE 7

Bounding the Probability of error Pe

min(a, b) ≤ minα∈(0,1) aαb1−α for a, b > 0, upper bound Pe: Pe = 1 2

  • x∈X

min(p1(x), p2(x))dν(x) ≤ 1 2 min

α∈(0,1)

  • x∈X

1 (x)p1−α 2

(x)dν(x). C(P1, P2) = − log min

α∈(0,1)

  • x∈X

1 (x)p1−α 2

(x)dν(x) ≥ 0, Best error exponent α∗ [7]: Pe ≤ wα∗

1 w1−α∗ 2

e−C(P1,P2) ≤ e−C(P1,P2) Bounding technique can be extended using any quasi-arithmetic α-means [13, 9]...

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/20

slide-8
SLIDE 8

Computational information geometry

Exponential family manifold [4]: M = {pθ | pθ(x) = exp(t(x)⊤θ − F(θ))} Dually flat manifolds [1] enjoy dual affine connections [1]: (M, ∇2F(θ), ∇(e), ∇(m)). η = ∇F(θ), θ = ∇F ∗(η) Canonical divergence from Young inequality: A(θ1, η2) = F(θ1) + F ∗(η2) − θ⊤

1 η2 ≥ 0

F(θ) + F ∗(η) = θ⊤η

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/20

slide-9
SLIDE 9

MAP decision rule and additive Bregman Voronoi diagrams

KL(pθ1 : pθ2) = B(θ2 : θ1) = A(θ2 : η1) = A∗(η1 : θ2) = B∗(η1 : η2) Canonical divergence (mixed primal/dual coordinates): A(θ2 : η1) = F(θ2) + F ∗(η1) − θ⊤

2 η1 ≥ 0

Bregman divergence (uni-coordinates, primal or dual): B(θ2 : θ1) = F(θ2) − F(θ1) − (θ2 − θ1)⊤∇F(θ1) log pi(x) = −B∗(t(x) : ηi) + F ∗(t(x)) + k(x), ηi = ∇F(θi) = η(Pθi) Optimal MAP decision rule: MAP(x) = argmaxi∈{1,...,n}wipi(x) = argmaxi∈{1,...,n} − B∗(t(x) : ηi) + log wi, = argmini∈{1,...,n}B∗(t(x) : ηi) − log wi → nearest neighbor classifier [2, 10, 15, 16]

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/20

slide-10
SLIDE 10

MAP & nearest neighbor classifier

Bregman Voronoi diagrams (with additive weights) are affine diagrams [2]. argmini∈{1,...,n}B∗(t(x) : ηi) − log wi

◮ point location in arrangement [3] (small dims), ◮ Divergence-based search trees [16], ◮ GPU brute force [6].

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/20

slide-11
SLIDE 11

Geometry of the best error exponent: binary hypothesis

On the exponential family manifold, Chernoff α-coefficient [5]: cα(Pθ1 : Pθ2) =

θ1(x)p1−α θ2

(x)dµ(x) = exp(−J(α)

F (θ1 : θ2)),

Skew Jensen divergence [14] on the natural parameters: J(α)

F (θ1 : θ2) = αF(θ1) + (1 − α)F(θ2) − F(θ(α) 12 ),

Chernoff information = Bregman divergence for exponential families: C(Pθ1 : Pθ2) = B(θ1 : θ(α∗)

12

) = B(θ2 : θ(α∗)

12 )

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/20

slide-12
SLIDE 12

Geometry of the best error exponent: binary hypothesis

Chernoff distribution P∗ [12]: P∗ = Pθ∗

12 = Ge(P1, P2) ∩ Bim(P1, P2)

e-geodesic: Ge(P1, P2) = {E (λ)

12

| θ(E (λ)

12 ) = (1 − λ)θ1 + λθ2, λ ∈ [0, 1]},

m-bisector: Bim(P1, P2) : {P | F(θ1) − F(θ2) + η(P)⊤∆θ = 0}, Optimal natural parameter of P∗: θ∗ = θ(α∗)

12

= argminθ∈ΘB(θ1 : θ) = argminθ∈ΘB(θ2 : θ). → closed-form for order-1 family, or efficient bisection search.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/20

slide-13
SLIDE 13

Geometry of the best error exponent: binary hypothesis

P∗ = Pθ∗

12 = Ge(P1, P2) ∩ Bim(P1, P2)

pθ1 pθ2 pθ∗

12

m-bisector e-geodesic Ge(Pθ1, Pθ2) η-coordinate system Pθ∗

12

C(θ1 : θ2) = B(θ1 : θ∗

12)

Bim(Pθ1, Pθ2)

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/20

slide-14
SLIDE 14

Geometry of the best error exponent: multiple hypothesis

n-ary MHT [8] from minimum pairwise Chernoff distance: C(P1, ..., Pn) = min

i,j=i C(Pi, Pj)

Pm

e ≤ e−mC(Pi∗,Pj∗),

(i∗, j∗) = argmini,j=iC(Pi, Pj) Compute for each pair of natural neighbors [3] Pθi and Pθj, the Chernoff distance C(Pθi, Pθj), and choose the pair with minimal

  • distance. (Proof by contradiction using Bregman Pythagoras

theorem.) → Closest Bregman pair problem (Chernoff distance fails triangle inequality).

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/20

slide-15
SLIDE 15

Hypothesis testing: Illustration η-coordinate system

Chernoff distribution between natural neighbours

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/20

slide-16
SLIDE 16

Summary

Bayesian multiple hypothesis testing... ... from the viewpoint of computational geometry.

◮ probability of error & best MAP Bayesian rule ◮ total variation & Pe, upper-bounded by the Chernoff distance. ◮ Exponential family manifolds:

◮ MAP rule = NN classifier (additive Bregman Voronoi diagram) ◮ best error exponent from intersection geodesic/bisector for

binary hypothesis,

◮ best error exponent from closest Bregman pair for multiple

hypothesis.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/20

slide-17
SLIDE 17

Thank you

28th-30th August, Paris.

@incollection{HTIGCG-GSI-2013, year={2013}, booktitle={Geometric Science of Information}, volume={8085}, series={Lecture Notes in Computer Science}, editor={Frank Nielsen and Fr\’ed\’eric Barbaresco}, title={Hypothesis testing, information divergence and computational geometry}, publisher={Springer Berlin Heidelberg}, author={Nielsen, Frank}, pages={241-248} } c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/20

slide-18
SLIDE 18

Bibliographic references I

Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. Oxford University Press, 2000. Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete & Computational Geometry, 44(2):281–307, 2010. Jean-Daniel Boissonnat and Mariette Yvinec. Algorithmic Geometry. Cambridge University Press, New York, NY, USA, 1998. Lawrence D. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory. Institute of Mathematical Statistics, Hayworth, CA, USA, 1986. Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–507, 1952. Vincent Garcia, Eric Debreuve, Frank Nielsen, and Michel Barlaud. k-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching. In IEEE International Conference on Image Processing (ICIP), pages 3757–3760, 2010. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/20

slide-19
SLIDE 19

Bibliographic references II

Martin E. Hellman and Josef Raviv. Probability of error, equivocation and the Chernoff bound. IEEE Transactions on Information Theory, 16:368–372, 1970.

  • C. C. Leang and D. H. Johnson.

On the asymptotics of M-hypothesis Bayesian detection. IEEE Transactions on Information Theory, 43(1):280–282, January 1997. Frank Nielsen. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. submitted, 2012. Frank Nielsen. k-MLE: A fast algorithm for learning statistical mixture models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2012. preliminary, technical report on arXiv. Frank Nielsen. Hypothesis testing, information divergence and computational geometry. In Frank Nielsen and Fr´ ed´ eric Barbaresco, editors, Geometric Science of Information, volume 8085 of Lecture Notes in Computer Science, pages 241–248. Springer Berlin Heidelberg, 2013. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/20

slide-20
SLIDE 20

Bibliographic references III

Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE Signal Processing Letters (SPL), 20(3):269–272, March 2013. Frank Nielsen. Pattern learning and recognition on statistical manifolds: An information-geometric review. In Edwin Hancock and Marcello Pelillo, editors, Similarity-Based Pattern Recognition, volume 7953 of Lecture Notes in Computer Science, pages 1–25. Springer Berlin Heidelberg, 2013. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011. Frank Nielsen, Paolo Piro, and Michel Barlaud. Bregman vantage point trees for efficient nearest neighbor queries. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo (ICME), pages 878–881, 2009. Paolo Piro, Frank Nielsen, and Michel Barlaud. Tailored Bregman ball trees for effective nearest neighbors. In European Workshop on Computational Geometry (EuroCG), LORIA, Nancy, France, March 2009. IEEE. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/20