Distan ances an ces and infor ormation g geom eometry: y: A A - - PowerPoint PPT Presentation

distan ances an ces and infor ormation g geom eometry y a
SMART_READER_LITE
LIVE PREVIEW

Distan ances an ces and infor ormation g geom eometry: y: A A - - PowerPoint PPT Presentation

Recen ent c con ontribut utions ons to Distan ances an ces and infor ormation g geom eometry: y: A A compu putational onal v viewpoi oint Frank Nielsen Sony Computer Science Laboratories, Inc https://franknielsen.github.io/ 31


slide-1
SLIDE 1

Recen ent c con

  • ntribut

utions

  • ns to

Distan ances an ces and infor

  • rmation g

geom eometry: y: A A compu putational

  • nal v

viewpoi

  • int

Frank Nielsen Sony Computer Science Laboratories, Inc

https://franknielsen.github.io/ 31st July 2020

slide-2
SLIDE 2

Outlin line

  • 1. Siegel-Klein geometry (bounded complex matrix domains)

Hilbert geometry of the Siegel disk: The Siegel-Klein disk model https://arxiv.org/abs/2004.08160

  • 2. Information-geometric structures on the Cauchy manifold

On Voronoi Diagrams on the Information-Geometric Cauchy Manifolds Entropy 2020, 22(7), 713; https://doi.org/10.3390/e22070713 https://www.mdpi.com/1099-4300/22/7/713

  • 3. Generalizations of the Jensen-Shannon divergence & JS centroids

On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means Entropy 2019, 21(5), 485; https://doi.org/10.3390/e21050485 https://www.mdpi.com/1099-4300/21/5/485 On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid Entropy 2020, 22(2), 221; https://doi.org/10.3390/e22020221 https://www.mdpi.com/1099-4300/22/2/221

slide-3
SLIDE 3

Hilbert g geom

  • metry of
  • f the S

Sieg egel di disk: The S e Siegel el-Klein di disk mod

  • del

Frank Nielsen Sony Computer Science Laboratories, Inc

https://franknielsen.github.io/ July 2020 https://arxiv.org/abs/2004.08160

slide-4
SLIDE 4

Main s n standa ndard mod

  • del

els of

  • f h

hyper erbolic g c geom eometry

Conformal Poincaré model: Lesser known non-conformal Klein model: Hyperbolic Voronoi diagrams in 5 models

Straight geodesics

https://www.youtube.com/watch?v=i9IUzNxeH4o&t=3s

Hyperbolic Voronoi diagram Hyperbolic Voronoi diagram

Hyperbolic Voronoi diagrams made easy, IEEE ICCSA 2010

Metric tensor (Tissot indicatrix)

slide-5
SLIDE 5

Si Sieg egel up upper er s spa pace

Birth of symplectic geometry (complex matrix groups, Siegel & Hua, 1940’s) Generalization of the Poincaré upper plane to complex matrix domains:

PD: Positive-definite cone

Infinitesimal length element: Geodesic length distance: with the i-th real eigenvalue Matrix cross-ratio:

Spectral decomposition

R: Not Hermitian, but all real eigenvalues!

slide-6
SLIDE 6

Si Sieg egel up upper er s spa pace: e: Gener eneralize e PD matrix c con

  • ne

Si Sieg egel up upper er s spa pace: e: Gener eneralize e Poi

  • incaré up

upper er p plane

When complex dimension is 1, recover the Poincaré upper plane

PD: Positive-definite cone

several equivalent formulas…

slide-7
SLIDE 7

Gener eneralized l linea ear f fractional t trans nsformations

Real symplectic group Sp(d,R): Siegel upper space metric is invariant under generalized Moebius transformations called (biholomorphic) symplectic maps: Group inverse: Group action is transitive: (→ homogeneous space)

(matrix group representation) (translation Z=A+iB)

slide-8
SLIDE 8

Orien entation

  • n-pr

pres eser erving isometry i in t the S e Sieg egel el u upper per s space

When complex dimension is 1 (Poincaré upper plane), recover PSL(2,R) Stabilizer group of Z=iI: The symplectic orthogonal matrices: (informally, play the role of “rotations” in the Siegel geometry) Orientation preserving isometry:

PSL(2,R)

slide-9
SLIDE 9

Si Sieg egel di disk dom domain

Or equivalently A generalization of Poincaré conformal disk: Disk domain:

PSL(2,R) Partial Loewner ordering

Spectral/operator norm:

(= Maximum singular value >=0)

PSL(2,R) PSL(2,R)

Siegel disk domain: Shilov boundary Stratified space (by matrix rank)

slide-10
SLIDE 10

Siegel disk distance: Siegel translation of W1 to the origin matrix 0 (= Siegel translation):

Distanc nce e in t n the he Si Sieg egel el di disk domain

Costly to calculate because we need square root and inverse matrices

PSL(2,R)

When complex dimension is 1, recover the Poincaré disk metric:

PSL(2,R)

Siegel metric in the disk domain:

slide-11
SLIDE 11

Complex sym ymplecti tic gr group (for Si Sieg egel el di disk)

Equivalent to Orientation-preserving isometry in the Siegel disk: PSL(2,C) in 1D

slide-12
SLIDE 12

Con Conver ersions Si Sieg egel el up upper er spa pace e <-> > Si Sieg egel el di disk

Moebius transformations

(generalized linear fractional transformations)

slide-13
SLIDE 13

Som Some a e app pplications of

  • f Si

Sieg egel el symplect ectic geomet etry

  • Radar signal processing:
  • Frederic Barbaresco. Information geometry of covariance matrix: Cartan-Siegel homogeneous bounded domains,

Mostow/Berger bration and Frechet median. In Matrix information geometry, pages 199-255. Springer, 2013.

  • Ben Jeuris and Raf Vandebril. The Kahler mean of block-Toeplitz matrices with Toeplitz structured blocks.

SIAM Journal on Matrix Analysis and Applications, 37(3):1151-1175, 2016.

  • Congwen Liu and Jiajia Si. Positive Toeplitz operators on the Bergman spaces of the Siegel upper half-space.

Communications in Mathematics and Statistics, pages 1-22, 2019.

  • Image processing:

Reiner Lenz. Siegel descriptors for image processing. IEEE Signal Processing Letters, 23(5):625-628, 2016.

  • Statistics:
  • Miquel Calvo and Josep M Oller. A distance between elliptical distributions based in an embedding into the Siegel group.

Journal of Computational and Applied Mathematics, 145(2):319-334, 2002.

  • Emmanuel Chevallier, Thibault Forget, Frederic Barbaresco, and Jesus Angulo. Kernel density estimation on the Siegel

space with an application to radar processing. Entropy, 18(11):396, 2016.

slide-14
SLIDE 14

Poi

  • incaré con
  • nformal di

disk vs Klein no non-conf nforma mal d disk

  • Klein disk is non-conformal with geodesics straight Euclidean lines
  • Klein mode well-suited for computational geometry: Eg., Voronoi diagram

Q: What is the equivalent of Klein geometry for the Siegel disk domain?

Clipped affine diagram (power diagram)

Hyperbolic Voronoi diagram

slide-15
SLIDE 15

Hilber ert ( (projec ective) e) g geom eometry

Normed vector space Bounded open convex domain Ω Define Hilbert distance: Cross-ratio: Related to Birkhoff geometry on (d+1)-dimensional cones

slide-16
SLIDE 16

Rewriting t the he Hilber ert di distance

Or equivalently (p,q expressed from linear interpolations of boundary points):

slide-17
SLIDE 17

Si Sieg egel-Klei ein di disk model el

In complex dimension 1, recover the Klein disk: Choose constant ½ to match Klein disk geometry

slide-18
SLIDE 18

Ca Calculating t g the he Si Sieg egel el-Klei ein di distance nce

Line passing through two matrix points:

Calculate the two α values on Shilov boundary In practice, perform bisection search for the α values…

Siegel-Klein distance:

slide-19
SLIDE 19

Si Sieg egel-Klein n di distance t nce to

  • the

he or

  • rigin (zero matri

rix 0 0)

Solve for and

Special case I

Siegel disk distance:

Exact

slide-20
SLIDE 20

Sieg egel el-Klei ein d n distanc nce: e: Line p e passing ng thr hrough t

  • ugh the o

e origin

Line (K1K2) passing through the origin:

Special case II

Siegel-Klein distance:

Exact

slide-21
SLIDE 21

Si Sieg egel-Klei ein di distance nce be between een di diagonal m matrices ces

Solve d quadratic systems for getting two α values:

Special case III

Siegel-Klein distance:

Exact

slide-22
SLIDE 22

App pproximating H Hilber ert g geom eometry with nes nested dom ed domains

Enough to check in 1D:

slide-23
SLIDE 23

Guaranteed eed a approximation o

  • n of the Sieg

egel el-Kl Klein d distance

slide-24
SLIDE 24

Con Conver erting g Si Sieg egel el-Po Poincaré (W) t to/

  • /from Si

Sieg egel-Klei ein ( (K)

Radial contraction to the origin: Radial expansion to the origin:

Siegel-Klein-> Siegel-Poincaré Siegel-Poincaré-> Siegel-Klein-

slide-25
SLIDE 25

Si Sieg egel-Klei ein g geod eodes esics cs are e unique E e Euc uclidea ean s straigh ght

Main advantage of the Siegel-Klein model is that geodesics are straight Many computational geometric techniques thus apply: For example: Smallest Enclosing Balls, etc.

Follow from the definition of the Hilbert distance and the cross-ratio properties:

slide-26
SLIDE 26

Geodes eodesics cs i in H Hilber ert g geom eometry may not not be be un unique

https://www.youtube.com/watch?v=Gz0Vjk5quQE

Hexagonal ball shapes

Geodesics in Cayley-Klein geometry are unique. (= Hilbert geometry for ellipsoidal domains)

Hilbert simplex geometry

(isometric to a normed space)

Clustering in Hilbert’s projective geometry: The case studies of the probability simplex and the elliptope of correlation matrices

Hilbert geometry of elliptope (space of correlation matrices)

https://franknielsen.github.io/elliptope/index.html

slide-27
SLIDE 27

Sum Summary of

  • f Si

Sieg egel-Klein g geom eometry:

  • Siegel and Hua studied in the 1940’s the geometry of bounded complex matrix

domains (= birth of symplectic geometry not directly related to symplectic manifolds equipped with a closed non-degenerate 2-form)

  • The Siegel upper space generalizes the Poincaré upper plane, and the Siegel disk

generalizes the Poincaré disk. Siegel upper space further includes in the cone of symmetric positive definite (SPD) matrices on the imaginary i-axis

  • Orientation-preserving isometry group of the Siegel upper space is the projective

real symplectic group. PSL(2,R) when complex dimension is 1. Orientation- preserving isometry group of the Siegel disk is the projective complex symplectic

  • group. PSL(2,C) when complex dimension is 1.
  • Hilbert geometry on the Siegel disk ensures straight line geodesics. Well-suited to

computational geometry in the Siegel-Klein disk (eg, smallest enclosing ball)

  • Siegel-Klein distance between two matrices can be calculated exactly when the

line passing through the two matrices goes through the origin, or for diagonal

  • matrices. Otherwise, guaranteed approximations of the Siegel-Klein distance by

considering nested Hilbert geometries (require maximum singular values only).

https://arxiv.org/abs/2004.08160

slide-28
SLIDE 28

Thank you!

https://arxiv.org/abs/2004.08160 Carl Ludwig Siegel 1896 - 1981 Hua Luogeng Hua Loo-Keng 华罗庚 1910-1985 Henri Poincaré 1854-1912 Felix Klein 1849 – 1925 David Hilbert 1862-1943

slide-29
SLIDE 29

Som Some r e refer erences es:

  • Carl Ludwig Siegel. Symplectic geometry. American Journal of Mathematics, 65(1):1-86, 1943.
  • Loo-Keng Hua. On the theory of automorphic functions of a matrix variable I: Geometrical
  • basis. American Journal of Mathematics, 66(3):470-488, 1944.
  • Loo-Keng Hua. Geometries of matrices. II. study of involutions in the geometry of symmetric
  • matrices. Transactions of the American Mathematical Society, 61(2):193-228, 1947.
  • Frédéric Barbaresco. Information geometry of covariance matrix: Cartan-Siegel homogeneous

bounded domains, Mostow/Berger fibration and Frechet median. In Matrix information geometry, pages 199-255. Springer, 2013.

  • Giovanni Bassanelli. On horospheres and holomorphic endomorfisms of the Siegel disc.

Rendiconti del Seminario Matematico della Universita di Padova, 70:147-165, 1983.

  • Pedro Jorge Freitas. On the action of the symplectic group on the Siegel upper half plane. PhD

thesis, University of Illinois at Chicago, 1999.

  • Nielsen, Frank, and Ke Sun. Clustering in Hilbert’s projective geometry: The case studies of the

probability simplex and the elliptope of correlation matrices. Geometric Structures of

  • Information. Springer, Cham, 2019. 297-331.

https://arxiv.org/abs/2004.08160 Siegel-Klein geometry:

slide-30
SLIDE 30

On V Vor

  • rono
  • noi D

Diagrams on

  • n the

In Informatio ion-Geometri ric c Cauchy M y Mani nifolds

Frank Nielsen Sony Computer Science Laboratories, Inc

https://franknielsen.github.io/ July 2020 On Voronoi Diagrams on the Information-Geometric Cauchy Manifolds Entropy 2020, 22(7), 713; https://doi.org/10.3390/e22070713 https://www.mdpi.com/1099-4300/22/7/713

slide-31
SLIDE 31

Vor

  • ronoi di

diagr grams: V Vor

  • ronoi pr

proximity c cel ells

Given a finite point set Voronoi cell: Euclidean distance (norm-induced): The Voronoi diagram partitions the space into Voronoi cells

slide-32
SLIDE 32

Dua ual V Vor

  • ronoi struc

ucture i e is the he Del elaunay c com

  • mplex

Link adjacent Voronoi generators by a straight (geodesic) edge:

Voronoi Delaunay

Delaunay complex yields the Delaunay triangulation when no d+2 cocircular : nice meshing properties

Dual orthogonal structures

slide-33
SLIDE 33

Vor

  • ronoi di

diagr grams f for

  • r asymmetric

ic d dissim imila larit itie ies

Asymmetric (oriented) distance: = Dual bisector is primal bisector for dual dissimilarity Involution:

Dual Voronoi cells:

Dual distance:

slide-34
SLIDE 34

Example: e: Br Bregm egman V Vor

  • ronoi diagrams

Bregman divergence for a convex C2 generator F: Recover the ordinary Euclidean Voronoi diagram when

Boissonnat, N, Nock. "Bregman Voronoi diagrams." Discrete & Computational Geometry 44.2 (2010): 281-307.

Three types of Voronoi diagrams: Primal (curved) Dual (always affine) Symmetrized (curved)

slide-35
SLIDE 35

The Ca he Cauch uchy m manifold

Manifold of the Cauchy distributions (Lorentzian distributions): Location-scale family (l,s) with base standard Cauchy distribution: Several kinds of manifold information-geometric structures induced by:

1. Fisher-Rao geometry: Fisher information metric (+ Levi-Civita metric connection) 2. α-geometry: Dualistic structure (Amari-Chentsov cubic tensor T), alpha connections 3. D-geometry: Dualistic geometry from divergence (e.g., Kullback-Leibler divergence) 4. Hessian geometry from Hessian metrics (smooth flat divergence + conformal flattening)

slide-36
SLIDE 36

Ca Cauch uchy m mani nifold: F Fish sher er-Ra Rao Ri Riem emannian g geom eometry

Fisher information matrix (FIM) yielding Fisher Riemannian metric (FIm): Fisher-Rao distance is a geodesic length and metric distance:

Scaled hyperbolic Poincaré upper plane metric

where

slide-37
SLIDE 37

Ca Cauch uchy m mani nifold: Ra Rao’

  • ’s di

distance

Fisher-Rao distance between Cauchy distributions: Extended to multidimensional “isotropic” location-scale families:

slide-38
SLIDE 38

Skewness cubic tensor (Amari-Chentsov totally symmetric tensor):

Ca Cauch uchy m mani nifold: Always s cu curved se self-dual s struc uctures es!

α-geometry:

All α-geometries coincide with the Fisher-Rao geometry for the Cauchy manifold:

Fisher-Rao geometry is 0-geometry :

Scalar curvature:

No way to choose α so that the α-geometry becomes dually flat

  • For the Gaussian distributions, we can choose α=1 or α=-1
  • For the t-Student distributions, we can choose:
slide-39
SLIDE 39

Ca Cauch uchy m mani nifold: q q-Gaussians f for

  • r q=2

q=2

Tsallis’ q-entropy: q-Gaussians are maximum entropy distributions wrt Tsallis’ q-entropy: Related to Onicescu’s informational energy: Shannon entropy Cauchy distributions are q-Gaussians for q=2: MaxEnt distributions for Tsallis’ quadratic entropy:

slide-40
SLIDE 40

Defor

  • rmed q=2

q=2-exponentia ial f l famil ilie ies

Deformed exponential function: Deformed reciprocal logarithm function: Deformed 2-exponential families (= Cauchy family): For Cauchy distributions, we find:

slide-41
SLIDE 41

Ca Cauch uchy 2 2-Gaus ussians: Ca Cano nonical f factor

  • rization

Natural parameters: Log-normalizer: Natural-to ordinary parameter conversion: Gradient of the log-normalizer: yields dual coordinate system eta

slide-42
SLIDE 42

Ca Cauch uchy m mani nifold: D Dua ually flat manifold

Bregman divergence: called the Bregman-Tsallis q=2-divergence

slide-43
SLIDE 43

Dua ual pot potential f funct ctions of

  • f the

he Hes essian s n struc ucture

Dual to primal conversion: Dual potential function: Dual-to-ordinary parameter conversion: Dual-to-ordinary parameter conversion:

slide-44
SLIDE 44

Dua ually f flat di diver ergence ( ce (=Br =Bregm egman d diver ergence) ce)

with the Legendre-Fenchel divergence: (non-negativity from Young’s inequality)

slide-45
SLIDE 45

Dual H Hessi ssians o s of the p potential f functions: s:

Crouzeix identity: Hessian metrics are conformal to the Fisher information metric:

Dual Hessian metrics

slide-46
SLIDE 46

Sum Summary: Ca Cauch uchy i inform rmation-geo eometric s c struct uctures es:

slide-47
SLIDE 47

Invariant f f-diver ergences es a and nd α-diver ergences ces:

f-divergences: f convex, f(1)=0 Standard f-divergence: f’(1)=0, f’’(1)=1

  • Invariant because its satisfies the information monotonicity, and
  • Infinitesimal small f-divergence is related to the Fisher information

α-divergences: Chernoff α-coefficient:

slide-48
SLIDE 48

α-diver ergences es a are e f-diver ergences ces:

Kullback-Leibler divergence: (relative entropy) Kullback-Leibler divergence between Cauchy distributions is symmetric:

A closed-form formula for the Kullback-Leibler divergence between Cauchy distributions, arXiv:1905.10965

slide-49
SLIDE 49

Fisher-Rao

  • di

distance a nce and nd c chi-sq squared di diver ergences ces:

Fisher-Rao distance is a metric distance

slide-50
SLIDE 50

Squar are-root metri rization of

  • f the K

he KL di diver ergence ce

The following function is a metric transform (and FR is metric distance):

slide-51
SLIDE 51

Sc Scale e family c case: e: Hi Hilbe bertian metric di c distance nce

Hilbertian norm Arithmetic mean: Geometric mean: A-G inequality: A>=G

slide-52
SLIDE 52

Ca Cauch uchy h hyper erbolic V c Vor

  • ronoi di

diagr grams

Voronoi bisectors (dual bisectors coincide for symmetric distances): Voronoi bisectors are invariant under strictly monotonically increasing functions

slide-53
SLIDE 53

Ca Cauch uchy h hyper erbolic V c Vor

  • ronoi di

diagr grams

Poincaré conformal upper plane

slide-54
SLIDE 54

Cauch chy h hyperbol

  • lic

c Vor

  • ronoi d

diagrams

Several models of hyperbolic geometry:

  • 1. Poincaré conformal upper plane
  • 2. Poincaré conformal disk
  • 3. Klein non-conformal disk:
slide-55
SLIDE 55

Ca Cauch uchy h hyper erbolic D c Del elaunay c com

  • mplex

Dual Delaunay complex by geodesically linking adjacent Voronoi cells Not necessarily a triangulation but a simplicial complex! Hyperbolic geometry is often used in ML for embedding hierarchical structures

slide-56
SLIDE 56

Hy Hyper erbol

  • lic D

Delaunay e edges es are orthog

  • gonal

al to V

  • Vor
  • ronoi bi

bisectors

Orthogonality with respect to the Riemannian metric

slide-57
SLIDE 57

Cauchy/Hyperbolic Voronoi diagrams

Poincaré upper plane Poincaré disk Klein disk

slide-58
SLIDE 58

Hyper erbol

  • lic Voron
  • noi d

diag agram w with al all unbo unbound nded V Vor

  • ronoi cells

Klein disk

slide-59
SLIDE 59

Hy Hyper perbol bolic De Delaun unay c complex: E Empty-sph pher ere p e proper perty

Generalize the empty sphere property of the ordinary Voronoi diagram

Empty sphere: The ball passing through d+1 sites is empty of other sites

slide-60
SLIDE 60

Dua ually f flat Ca Cauch uchy Vor

  • ronoi di

diagr grams

Primal bisector: coincide with the hyperbolic bisector: Dual bisector: coincide with the Euclidean bisector:

slide-61
SLIDE 61

Sum Summary of

  • f Ca

Cauch uchy V Vor

  • ronoi diagrams:
slide-62
SLIDE 62

Sum Summary: I Information-geo eometric Ca c Cauch uchy m mani nifolds

  • The α-geometries of the Cauchy manifolds all coincide, and yields a hyperbolic

geometry of constant negative scalar curvature -2.

  • By using Tsallis’ quadratic entropy, we can realize Cauchy distributions (q-Gaussians

for q=2) as maximum entropy distributions.

  • The dual potential functions induced by deformed q=2 log-normalizer yields a

conformal flattening of the curved Fisher-Rao geometry where the Riemannian metric is a conformal metric of the Fisher information metric.

  • The Kullback-Leibler divergence between two Cauchy distributions is symmetric, and

its square root yields a metric distance. For scaled Cauchy distributions, the square root of the KLD is a Hilbertian metric.

  • The Cauchy Voronoi diagrams wrt to the chi-squared, KL, and Fisher-Rao distances

coincide with a hyperbolic Voronoi diagram. The dual Voronoi diagram for the flat divergence coincides with the Euclidean Voronoi diagram.

  • The hyperbolic Delaunay complex is orthogonal to the hyperbolic Voronoi diagram,

and is often not a triangulation, hence its name hyperbolic Delaunay complex.

slide-63
SLIDE 63

On a G a Gen ener eral alization

  • n of
  • f t

the e Jen ensen sen–Shann Shannon Diver ergence e an and the e Jen ensen sen–Shann Shannon Ce Centroi

  • id

Frank Nielsen Sony Computer Science Laboratories, Inc

https://franknielsen.github.io/

On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid Entropy 2020, 22(2), 221; https://doi.org/10.3390/e22020221 https://www.mdpi.com/1099-4300/22/2/221

slide-64
SLIDE 64

The J he Jens ensen-Shannon di diver ergence i ce in a a nu nutsh shel ell

Do not require same support require same support

Kullback-Leibler divergence: (asymmetric, unbounded) Jensen-Shannon divergence: (symmetric, bounded) Shannon entropy: is a Hilbert metric space JSD (capacitory discrimination) = total KL divergence to the average distribution

slide-65
SLIDE 65

The e he extend ended J Jens ensen-Sha hannon d non divergence

Extended Kullback-Leibler divergence to positive measures: Extended Jensen-Shannon divergence to positive measures: Extended Jensen-Shannon divergence upper bounded by

slide-66
SLIDE 66

Sk Skewed J Jens ensen-Sha hannon d non divergences

Notation for statistical mixture: Skewed Jensen-Shannon divergence for By introducing the skewed Kullback-Leibler divergence: Symmetric skewed Jensen-Shannon divergence: … and we recover the JSD for ½:

slide-67
SLIDE 67

Jens ensen en-Shannon di diver ergen ences ces are e f-diver ergences ces

f-divergences for convex generator f, strictly convex at 1 with f(1)=0

(standard when f’(1)=0, f’’(1)=1)

f-divergences satisfy information monotonicity (= data processing inequality)

coarse binning, lumping

f-divergences upper bounded by

Skewed Jensen-Shannon divergences are f-divergences for the generator:

slide-68
SLIDE 68

Extend ending J g Jens ensen en-Shannon di diver ergences ces: Vec ector

  • r s

skewed J ed Jens ensen en–Bregm gman an D Divergences

Bregman divergence:

Skewing vector : Weight vector belongs to (standard k-simplex) Notation for linear interpolation:

Vector-skewed α-Jensen–Bregman divergence (α-JBD):

slide-69
SLIDE 69

Rewriting t ng the v e vector skewed ed J Jensen en–Br Bregm egman d n diver ergenc ences es

We have: Therefore Rewrites as The inner product vanishes when we choose And we get the vector-skew α-JBD: Notation:

slide-70
SLIDE 70

Vec ector

  • r-sk

skew J Jens ensen–Sha Shanno non n diver ergences ces

Invariant information-monotone divergences Nice for optimization

slide-71
SLIDE 71

Proper erties es of

  • f t

the he vec ector

  • r-skew J

JS S di diver ergences ces

slide-72
SLIDE 72

Jens ensen en–Sha hanno non c n centroids ds o

  • n mixture families

es

Mixture family in information geometry (w-mixtures) Example: The family of categorical distributions is a mixture family: The Kullback-Leibler divergence between two mixture distributions amount to a Bregman divergence for the negentropy generator:

slide-73
SLIDE 73

Jens ensen en–Sha hanno non c n centroids ds

Like the Fréchet mean, we define the Jensen-Shannon centroid as the minimizer(s) of This defines a Difference of Convex (DC) program: With convex functions:

slide-74
SLIDE 74

Jens ensen en–Shannon

  • n centroi
  • ids: C

CCCP P

Convex-ConCave Procedure (CCCP) is step-size free optimization for smooth DC programs:

  • Initialize arbitrarily (eg, centroid)
  • Iteratively update:
slide-75
SLIDE 75

Visual alization

  • n o
  • f the CCCP

Interpretation: Support hyperplanes to A graph shall be parallel to B graph

slide-76
SLIDE 76

Jens ensen en-Sha hanno non c n centroid f d for c categorical d distribut butions ns

Shannon neg-entropy is a strictly convex and differentiable Bregman generator: Mixture family (mixture of mixtures is a mixture):

slide-77
SLIDE 77

Jens ensen en-Sha hanno non n centroid: d: I Impl plementing ng C CCCP

Initialize: Iterate:

slide-78
SLIDE 78

Exper periments:

Jeffreys centroid (grey histogram) Jensen–Shannon centroid (black histogram) Lena image (red histogram) Barbara image (blue histogram)

Close to zero in [0,20]

slide-79
SLIDE 79

negative image histogram Barbara histogram

slide-80
SLIDE 80

JSD SD always bo bounded ed e even en on di

  • n differ

erent sup upports

slide-81
SLIDE 81

Sum Summary: V Vec ector

  • r-skewed J

ed Jensen en-Shannon di diver ergen ence ce

  • Jensen-Shannon divergence is a bounded symmetrization of the Kullback-Leibler

divergence (KLD) which allows to measure the distance between distributions with potentially different supports (useful in ML like GANs)

  • Jensen-Shannon divergence is a f-divergence which satisfies the data processing

inequality

  • Generalize the weighted skewed Jensen-Shannon divergence by using a skew vector

parameter :

  • The vector-skewed Jensen-Shannon divergence is an information monotone f-

divergence

  • The (vector-skewed) Jensen-Shannon centroids can be modeled using a smooth

Difference of Convex (DC) program and solved using

  • the Convex-ConCave Procedure (CCCP)

https://www.mdpi.com/1099-4300/22/2/221

slide-82
SLIDE 82

On t the J Jen ense sen–Sha hanno non Symmet metrization

  • f
  • f Di

Distanc nces es Relying on

  • n Ab

Abstrac act M Mean eans

Frank Nielsen Sony Computer Science Laboratories, Inc

https://franknielsen.github.io/

On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means Entropy 2019, 21(5), 485; https://doi.org/10.3390/e21050485 https://www.mdpi.com/1099-4300/21/5/485 Code: https://franknielsen.github.io/M-JS/

slide-83
SLIDE 83

Un Unbounde unded Kull llback-Leib ible ler di diver ergence ( ce (KLD)

Also called relative entropy: Cross-entropy: Shannon’s entropy: (self cross-entropy) Reverse KLD:

(KLD=forward KLD)

slide-84
SLIDE 84

Symmetri rizations of

  • f t

the he Kull llback-Leib ible ler di diver ergence ce

Jeffreys’ divergence (twice the arithmetic mean of oriented KLDs): Resistor average divergence (harmonic mean of forward+reverse KLD) Question: Role and extensions of the mean in symmetrization ?

slide-85
SLIDE 85

Bounded J Jensen-Sha hannon non divergence ( (JSD) D)

(Shannon entropy h is strictly concave, JSD>=0) JSD is bounded: Proof: : Square root of the JSD is a metric distance (moreover Hilbertian)

Do not require same support

slide-86
SLIDE 86

Invariant f f-divergences, symmetrized f f-diver ergences ces

Convex generator f, strictly convex at 1 with f(1)=0 (standard when f’(1)=0, f’’(1)=1) f-divergences are said invariant in information geometry because they satisfy coarse-graining (data processing inequality)

f-divergences can always be symmetrized: Reverse f-divergence for

Jeffreys f-generator: Jensen-Shannon f-generator:

slide-87
SLIDE 87

St Statistical di distances es v vs pa parameter er vec ector

  • r di

distances nces

A statistical distance D between two parametric distributions of a same family (eg., Gaussian family) amount to a parameter distance P:

For example, the KLD between two densities of a same exponential family amounts to a reverse Bregman divergence for the Bregman cumulant generator: From a smooth C3 parameter distance (= contrast function), we can build a dualistic information-geometric structure

slide-88
SLIDE 88

Sk Skewed J Jens ensen-Br Breg egman di diver ergences ces

JS-kind symmetrization of the parameter Bregman divergence:

Notation for the linear interpolation:

slide-89
SLIDE 89

J-Symmetri rization and J nd JS-Symmetri rization

J-symmetrization of a statistical/parameter distance D: JS-symmetrization of a statistical/parameter distance D: Example: J-symmetrization and JS-symmetrization of f-divergences:

Conjugate f-generator:

slide-90
SLIDE 90

Gen eneralized J Jen ensen-Sha hann nnon d n diver ergenc ences es: Role o

  • f abstract weighted m

ed means ns, g gener neralized ed mixtures es

Quasi-arithmetic weighted means for a strictly increasing function h:

When M=A arithmetic mean, normalizer Z is 1

slide-91
SLIDE 91

Defin finit itio ions: M M-JSD SD and M M-JS S symmetrizations

Definition extended for generic distance D (not necessarily KLD):

slide-92
SLIDE 92

Gener eneric de c definition: ( (M,N)-JS symmetrization

Consider two abstract means M and N (eg, N harmonic as in resistor average distortion): The main advantage of (M,N)-JSD is to get closed-form formula for distributions belonging to given parametric families by carefully choosing the M-mean. For example, geometric mean for exponential families,

  • r the harmonic mean for Cauchy or t-Student families, etc.
slide-93
SLIDE 93

(A,G) G)-Jen ensen en-Shannon d nnon diver ergen ence f e for exponen ponential f families es

Exponential family: Natural parameter space: Geometric statistical mixture: Normalization coefficient: Jensen parameter divergence:

slide-94
SLIDE 94

(A,G) G)-Jen ensen en-Shannon d nnon diver ergen ence f e for exponen ponential f families es

Closed-form formula the KLD between two geometric mixtures in term of a Bregman divergence between interpolated parameters:

slide-95
SLIDE 95

Example: e: M Mul ultivariate G e Gaus ussian e expo ponential family

Family of Normal distributions: Cumulant function/log-normalizer: Sufficient statistics: Canonical factorization:

slide-96
SLIDE 96

Example: e: M Mul ultivariate G e Gaus ussian e expo ponential family

Dual moment parameterization: Conversions between ordinary/natural/expectation parameters: Dual potential function (=negative differential Shannon entropy):

slide-97
SLIDE 97
slide-98
SLIDE 98

Mor

  • re

e exampl ples es: A Abstract m mea eans ns and nd M-mixtures

https://www.mdpi.com/1099-4300/21/5/485

slide-99
SLIDE 99

Sum Summary: G Gener eneralized Jens ensen-Sha hanno non d n divergences

  • Jensen-Shannon divergence (JSD) is a bounded symmetrization of the Kullback-

Leibler divergence (KLD). Jeffreys divergence (JD) is an unbounded symmetrization

  • f KLD. Both JSD and JD are invariant f-divergences.
  • Although KLD and JD between Gaussians (or densities of a same exponential

family) admits closed-form formulas, the JSD between Gaussians does not have a closed-form expression, and these distances need to be approximated in

  • applications. (machine learning, eg., GANs in deep learning)
  • The skewed Jensen-Shannon divergence is based on statistical arithmetic mixtures.

We define generic statistical M-mixtures based on an abstract mean, and define accordingly the M-Jensen-Shannon divergence, and further the (M,N)-JSD.

  • When M=G is the geometric weighted mean, we obtain closed-form formula for

the G-Jensen-Shannon divergence between Gaussian distributions. Applications to machine learning (eg, deep learning GANs)

https://franknielsen.github.io/M-JS/ Code: https://arxiv.org/abs/2006.10599