Geometric losses for distributional learning Arthur Mensch (1) , - - PowerPoint PPT Presentation

geometric losses for distributional learning
SMART_READER_LITE
LIVE PREVIEW

Geometric losses for distributional learning Arthur Mensch (1) , - - PowerPoint PPT Presentation

Geometric losses for distributional learning Arthur Mensch (1) , Mathieu Blondel (2) , Gabriel Peyr e (1) (1) Ecole Normale Sup erieure, DMA Centre National pour la Recherche Scientifique Paris, France (2) NTT Communication Science


slide-1
SLIDE 1

Geometric losses for distributional learning

Arthur Mensch(1), Mathieu Blondel(2), Gabriel Peyr´ e(1)

(1) ´

Ecole Normale Sup´ erieure, DMA Centre National pour la Recherche Scientifique Paris, France

(2) NTT Communication Science Laboratories

Kyoto, Japan

June 12, 2019

slide-2
SLIDE 2

1

Introduction

2

Fenchel-Young losses for distribution spaces

3

Geometric softmax from Sinkhorn negentropies

4

Applications

slide-3
SLIDE 3

Introduction: Predicting distributions

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 1 / 16

slide-4
SLIDE 4

Introduction: Predicting distributions

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 1 / 16

slide-5
SLIDE 5

Introduction: Predicting distributions

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 1 / 16

slide-6
SLIDE 6

Introduction: Predicting distributions

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 1 / 16

slide-7
SLIDE 7

Introduction: Predicting distributions

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 1 / 16

slide-8
SLIDE 8

Introduction: Predicting distributions

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 1 / 16

slide-9
SLIDE 9

Contribution: losses and links for continuous metrized output

Handling output geometry Link and loss with cost between classes C : Y × Y → R Output distribution over continuous space Y

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 2 / 16

slide-10
SLIDE 10

Contribution: losses and links for continuous metrized output

Handling output geometry Link and loss with cost between classes C : Y × Y → R Output distribution over continuous space Y New geometric losses and associated link functions: 1 Construction from duality between distributions and scores

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 2 / 16

slide-11
SLIDE 11

Contribution: losses and links for continuous metrized output

Handling output geometry Link and loss with cost between classes C : Y × Y → R Output distribution over continuous space Y New geometric losses and associated link functions: 1 Construction from duality between distributions and scores 2 Need: Convex functional on distribution space Provided by regularized optimal transport

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 2 / 16

slide-12
SLIDE 12

Background: learning with a cost over outputs Y

Cost augmentation of losses1, 2: Convex cost-aware loss Lc : [1, d] × Rd → R

!

△ Undefined link functions: Rd → △d: what to predict at test time ?

1Ioannis Tsochantaridis et al. “Large margin methods for structured and interdependent output variables”. In: JMLR (2005). 2Kevin Gimpel and Noah A Smith. “Softmax-margin CRFs: Training log-linear models with cost functions”. In: NAACL. 2010.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 3 / 16

slide-13
SLIDE 13

Background: learning with a cost over outputs Y

Cost augmentation of losses1, 2: Convex cost-aware loss Lc : [1, d] × Rd → R

!

△ Undefined link functions: Rd → △d: what to predict at test time ?

Use a Wasserstein distance between output distributions3: Ground metric C defines a distance WC between distributions Prediction with a softmax link ℓ(α, f) WC(softmax(f), α))

!

△ Non-convex loss and costly to compute

1Ioannis Tsochantaridis et al. “Large margin methods for structured and interdependent output variables”. In: JMLR (2005). 2Kevin Gimpel and Noah A Smith. “Softmax-margin CRFs: Training log-linear models with cost functions”. In: NAACL. 2010. 3Charlie Frogner et al. “Learning with a Wasserstein loss”. In: NIPS. 2015.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 3 / 16

slide-14
SLIDE 14

1

Introduction

2

Fenchel-Young losses for distribution spaces

3

Geometric softmax from Sinkhorn negentropies

4

Applications

slide-15
SLIDE 15

Predicting distributions from topological duality

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 4 / 16

slide-16
SLIDE 16

Predicting distributions from topological duality

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 4 / 16

slide-17
SLIDE 17

Predicting distributions from topological duality

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 4 / 16

slide-18
SLIDE 18

Predicting distributions from topological duality

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 4 / 16

slide-19
SLIDE 19

Predicting distributions from topological duality

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 4 / 16

slide-20
SLIDE 20

Predicting distributions from topological duality

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 4 / 16

slide-21
SLIDE 21

All you need is a convex functional

Fenchel-Young losses4 5: Convex function Ω : △d → R

4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018). 5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 5 / 16

slide-22
SLIDE 22

All you need is a convex functional

Fenchel-Young losses4 5: Convex function Ω : △d → R and conjugate Ω⋆(f) = min

α∈△d Ω(α) − α, f

4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018). 5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 5 / 16

slide-23
SLIDE 23

All you need is a convex functional

Fenchel-Young losses4 5: Convex function Ω : △d → R and conjugate Ω⋆(f) = min

α∈△d Ω(α) − α, f

ℓΩ(α, f) = Ω(α) + Ω⋆(f) − α, f 0

4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018). 5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 5 / 16

slide-24
SLIDE 24

All you need is a convex functional

Fenchel-Young losses4 5: Convex function Ω : △d → R and conjugate Ω⋆(f) = min

α∈△d Ω(α) − α, f

ℓΩ(α, f) = Ω(α) + Ω⋆(f) − α, f 0 Define link functions between dual and primal ∇Ω(α) = argmin

f∈Rd

ℓΩ(α, f) ∇Ω⋆(f) = argmin

α∈△d ℓΩ(α, f)

4John C. Duchi et al. “Multiclass Classification, Information, Divergence, and Surrogate Risk”. In: Annals of Statistics (2018). 5Mathieu Blondel et al. “Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms”. In: AISTATS. 2019.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 5 / 16

slide-25
SLIDE 25

Discrete canonical example: Shannon entropy

Ω(α) = −H(α) =

d

  • i=1

αi log αi Ω∗(f) = logsumexp(f)

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 6 / 16

slide-26
SLIDE 26

Discrete canonical example: Shannon entropy

Ω(α) = −H(α) =

d

  • i=1

αi log αi Ω∗(f) = logsumexp(f)

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 6 / 16

slide-27
SLIDE 27

Discrete canonical example: Shannon entropy

Ω(α) = −H(α) =

d

  • i=1

αi log αi Ω∗(f) = logsumexp(f)

Not defined on continuous distributions, cost-agnostic

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 6 / 16

slide-28
SLIDE 28

1

Introduction

2

Fenchel-Young losses for distribution spaces

3

Geometric softmax from Sinkhorn negentropies

4

Applications

slide-29
SLIDE 29

Sinkhorn entropies from regularized optimal transport

Self regularized optimal transportation distance: ΩC(α) = −1 2OTC,ε=2(α, α) = − max

f∈C(Y)α, f − logα ⊗ α, exp(f ⊕ f − C

2 )

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 7 / 16

slide-30
SLIDE 30

Sinkhorn entropies from regularized optimal transport

Self regularized optimal transportation distance: ΩC(α) = −1 2OTC,ε=2(α, α) = − max

f∈C(Y)α, f − logα ⊗ α, exp(f ⊕ f − C

2 )

Continuous convex

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 7 / 16

slide-31
SLIDE 31

Sinkhorn entropies from regularized optimal transport

Self regularized optimal transportation distance: ΩC(α) = −1 2OTC,ε=2(α, α) = − max

f∈C(Y)α, f − logα ⊗ α, exp(f ⊕ f − C

2 )

Continuous convex Special cases ε → ∞: MMD autocorrelation

C =

    

∞ . . . ∞ ... . . . ... ...

    

Shannon entropy Gini index

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 7 / 16

slide-32
SLIDE 32

Dual mapping from Sinkhorn negentropy

Sinkhorn entropy: Ω(α) = − max

f∈C(Y)α, f−logα⊗α, e

f⊕f−C 2

  • Arthur Mensch, Mathieu Blondel, Gabriel Peyr´

e Geometric losses for distributional learning 8 / 16

slide-33
SLIDE 33

Returning to primal: geometric softmax

Ω∗ = g-logsumexp : f → − log min

α∈M+

1 (Y)α ⊗ α, exp(−f ⊕ f + C

2 ) ∇Ω∗ = geometric-softmax. Minimizes a simple quadratic.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 9 / 16

slide-34
SLIDE 34

Geomtric loss construction and computation

Training with the geometric logistic loss:

ℓC(α, f) = geometric-LSEC(f) + sinkhorn-negentropyC(α) − α, f

Tractable discrete Y: use mirror descent/L-BFGS 10× as costly as a softmax Backpropagation: ∇Ω⋆ = geometric-softmax Continuous case: Frank-Wolfe scheme, adding one Dirac at each iteration

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 10 / 16

slide-35
SLIDE 35

Properties of the geometric-softmax

−2 2

f1 − f2

0.0 0.2 0.4 0.6 0.8 1.0

∇Ω⋆(f)1 C = γ γ

  • 0.00

0.25 0.50 0.75 1.00

α

0.0 0.2 0.4 0.6

−Ω([α, 1 − α])

γ = ∞ γ = 0.1 γ = 2

∇Ω⋆ returns from Sinkhorn potentials: ∇Ω⋆ ◦ ∇Ω = Id

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 11 / 16

slide-36
SLIDE 36

Properties of the geometric-softmax

−2 2

f1 − f2

0.0 0.2 0.4 0.6 0.8 1.0

∇Ω⋆(f)1 C = γ γ

  • 0.00

0.25 0.50 0.75 1.00

α

0.0 0.2 0.4 0.6

−Ω([α, 1 − α])

γ = ∞ γ = 0.1 γ = 2

∇Ω⋆ returns from Sinkhorn potentials: ∇Ω⋆ ◦ ∇Ω = Id ∇Ω ◦ ∇Ω∗ projects f onto F the set of symmetric Sinkhorn potentials Sparse ∇Ω∗(f) ← minimization on the simplex

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 11 / 16

slide-37
SLIDE 37

Properties of the geometric-softmax ε → 0 Mode finding ε → ∞ Positive deconvolution

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 12 / 16

slide-38
SLIDE 38

Consistent learning with the geometric logistic loss Bregman divergence from Sinkhorn negentropy: Dg(α|β) = Ω(α) − Ω(β) − ∇Ω(α), α − β.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 13 / 16

slide-39
SLIDE 39

Consistent learning with the geometric logistic loss Bregman divergence from Sinkhorn negentropy: Dg(α|β) = Ω(α) − Ω(β) − ∇Ω(α), α − β. Sample distribution (xi, αi)i ∈ X × M+

1 (Y).

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 13 / 16

slide-40
SLIDE 40

Consistent learning with the geometric logistic loss Bregman divergence from Sinkhorn negentropy: Dg(α|β) = Ω(α) − Ω(β) − ∇Ω(α), α − β. Sample distribution (xi, αi)i ∈ X × M+

1 (Y).

Fisher consistency: min

β:X→M+

1 (Y) E [Dg (α, β (x))] =

min

g:X→C(Y) E [ℓg (α, ∇Ω∗ (g (x)))]

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 13 / 16

slide-41
SLIDE 41

1

Introduction

2

Fenchel-Young losses for distribution spaces

3

Geometric softmax from Sinkhorn negentropies

4

Applications

slide-42
SLIDE 42

Applications: variational auto-encoder

Goal: Generate nearly 1D distribution in 2D images Dataset: Google Quickdraw Traditional sigmoid activation layer ← replaced by geometric softmax

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 14 / 16

slide-43
SLIDE 43

Applications: variational auto-encoder

Goal: Generate nearly 1D distribution in 2D images Dataset: Google Quickdraw Traditional sigmoid activation layer ← replaced by geometric softmax

Deconvolutional effect Cost-informed non-linearity

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 14 / 16

slide-44
SLIDE 44

Applications: variational auto-encoders

softmax g-softmax Reconstruction Generation

Better defined generated images

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 15 / 16

slide-45
SLIDE 45

Conclusion

Geometric softmax: New loss and projector onto output probabilities Discrete/continuous, aware of a cost between outputs Fenchel duality in Banach spaces + regularized optimal transport Application in VAE and ordinal regression

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 16 / 16

slide-46
SLIDE 46

Conclusion

Geometric softmax: New loss and projector onto output probabilities Discrete/continuous, aware of a cost between outputs Fenchel duality in Banach spaces + regularized optimal transport Application in VAE and ordinal regression Future directions: How to improve computation methods (continuous FW) Geometric logisitic loss in super resolution6

6Nicholas Boyd et al. “DeepLoco: Fast 3D localization microscopy using neural networks”. In: BioRxiv (2018), p. 267096.

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 16 / 16

slide-47
SLIDE 47

Mathieu Blondel Gabriel Peyr´ e

Poster # 179

Arthur Mensch, Mathieu Blondel, Gabriel Peyr´ e Geometric losses for distributional learning 16 / 16