E ffi cient Bregman Projections Onto the Simplex Walid Krichene - - PowerPoint PPT Presentation

e ffi cient bregman projections onto the simplex
SMART_READER_LITE
LIVE PREVIEW

E ffi cient Bregman Projections Onto the Simplex Walid Krichene - - PowerPoint PPT Presentation

Introduction Projection Algorithms Numerical experiments E ffi cient Bregman Projections Onto the Simplex Walid Krichene Syrine Krichene Alexandre Bayen Electrical Engineering and Computer Sciences, UC Berkeley ENSIMAG and Criteo Labs, France


slide-1
SLIDE 1

1/15 Introduction Projection Algorithms Numerical experiments

Efficient Bregman Projections Onto the Simplex

Walid Krichene Syrine Krichene Alexandre Bayen Electrical Engineering and Computer Sciences, UC Berkeley ENSIMAG and Criteo Labs, France

!

December 16, 2015

slide-2
SLIDE 2

1/15 Introduction Projection Algorithms Numerical experiments

Outline

1 Introduction 2 Projection Algorithms 3 Numerical experiments

slide-3
SLIDE 3

1/15 Introduction Projection Algorithms Numerical experiments

Outline

1 Introduction 2 Projection Algorithms 3 Numerical experiments

slide-4
SLIDE 4

2/15 Introduction Projection Algorithms Numerical experiments

Bregman Projections onto the simplex

Bregman projections are the building block of mirror descent (Nemirovski and Yudin) and dual averaging (Nesterov). Convex optimization: minx∈X f (x) Online learning (regret minimization).

slide-5
SLIDE 5

2/15 Introduction Projection Algorithms Numerical experiments

Bregman Projections onto the simplex

Bregman projections are the building block of mirror descent (Nemirovski and Yudin) and dual averaging (Nesterov). Convex optimization: minx∈X f (x) Online learning (regret minimization). Algorithm 2 Mirror descent method

1: for ⌧ 2 N do 2:

Query a sub-gradient vector g(⌧) 2 @f (x(⌧)) (or loss vector)

3:

Update x(⌧+1) = arg min

x∈X

D (x,(r )−1(r (x(⌧)) ⌘⌧g(⌧))) (1) : strongly convex distance generating function. D : Bregman divergence.

slide-6
SLIDE 6

3/15 Introduction Projection Algorithms Numerical experiments

Illustration of Bregman projections

x(τ) x(τ+1) E E∗ X rψ ητg(τ) (rψ)−1

Figure: Illustration of a mirror descent iteration.

x(⌧+1) = arg min

x∈X

D (x,(r )−1(r (x(⌧)) ⌘⌧g(⌧)))

slide-7
SLIDE 7

4/15 Introduction Projection Algorithms Numerical experiments

More precisely

Feasible set is the simplex (or cartesian product of simplexes) ∆ = ( x 2 Rd

+ :

X

i

xi = 1 ) Motivation: online learning, optimization with probability distributions.

slide-8
SLIDE 8

4/15 Introduction Projection Algorithms Numerical experiments

More precisely

Feasible set is the simplex (or cartesian product of simplexes) ∆ = ( x 2 Rd

+ :

X

i

xi = 1 ) Motivation: online learning, optimization with probability distributions. DGF is induced by a potential. (x) = X

i

f (xi) f (x) = R x

1 −1(u)du, increasing, called the potential.

Consequence: known expression of r and (r )−1.

slide-9
SLIDE 9

4/15 Introduction Projection Algorithms Numerical experiments

Outline

1 Introduction 2 Projection Algorithms 3 Numerical experiments

slide-10
SLIDE 10

5/15 Introduction Projection Algorithms Numerical experiments

Projection algorithms

General strategy:

Derive optimality conditions Design algorithm to satisfy conditions.

slide-11
SLIDE 11

6/15 Introduction Projection Algorithms Numerical experiments

Optimality conditions

x? = arg min

x∈X

D (x,(r )−1(r (¯ x) ¯ g) Optimality conditions x? is optimal if and only if 9⌫? 2 R: ( 8i, x?

i =

  • (−1(¯

xi) ¯ gi + ⌫?)

  • + ,

Pd

i=1 x? i = 1,

Proof: write KKT conditions, eliminate complementary slackness.

slide-12
SLIDE 12

6/15 Introduction Projection Algorithms Numerical experiments

Optimality conditions

x? = arg min

x∈X

D (x,(r )−1(r (¯ x) ¯ g) Optimality conditions x? is optimal if and only if 9⌫? 2 R: ( 8i, x?

i =

  • (−1(¯

xi) ¯ gi + ⌫?)

  • + ,

Pd

i=1 x? i = 1,

Proof: write KKT conditions, eliminate complementary slackness. Comments: Reduced a problem in dimension d to a problem in dimension 1. The function c : ⌫ 7! P

i

  • (−1(¯

xi) ¯ gi + ⌫)

  • + is increasing.

Can solve for ⌫? using bisection.

slide-13
SLIDE 13

7/15 Introduction Projection Algorithms Numerical experiments

Bisection algorithm for general divergences

Algorithm 3 Bisection method to compute the projection x? with precision ✏.

1: Input: ¯

x, ¯ g, ✏.

2: Initialize

¯ ⌫ = −1(1) max

i

−1(¯ xi) ¯ gi ⌫ = −1 (1/d) max

i

−1(¯ xi) ¯ gi

3: while c(⌫) c(⌫) > ✏ do 4:

Let ⌫+ ¯

⌫+⌫ 2 5:

if c(⌫+) > 1 then

6:

¯ ⌫ ⌫+

7:

else

8:

⌫ ⌫+

9: Return ˜

x(¯ ⌫) =

  • (−1(¯

xi) ¯ gi + ¯ ⌫)

  • +

Theorem The algorithm terminates after O(ln 1

✏ ) iterations, and outputs ˜

x such that k˜ x(¯ ⌫) x?k  ✏

slide-14
SLIDE 14

8/15 Introduction Projection Algorithms Numerical experiments

Exact projections for exponential divergences

Special case 1: (x) = kxk2: can compute the solution exactly [1].

[1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, Efficient Projections onto the `1 Ball for Learning in High Dimensions, ICML 2008.

slide-15
SLIDE 15

8/15 Introduction Projection Algorithms Numerical experiments

Exact projections for exponential divergences

Special case 1: (x) = kxk2: can compute the solution exactly [1]. Special case 2: Exponential divergence: ✏ : (1, +1) ! (✏, +1) u 7! eu−1 ✏,

[1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, Efficient Projections onto the `1 Ball for Learning in High Dimensions, ICML 2008.

slide-16
SLIDE 16

8/15 Introduction Projection Algorithms Numerical experiments

Exact projections for exponential divergences

Special case 1: (x) = kxk2: can compute the solution exactly [1]. Special case 2: Exponential divergence: ✏ : (1, +1) ! (✏, +1) u 7! eu−1 ✏, For ✏ = 0: (x) = H(x) = P

i xi ln xi (negative entropy).

D (x, y) = DKL(x, y).

[1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, Efficient Projections onto the `1 Ball for Learning in High Dimensions, ICML 2008.

slide-17
SLIDE 17

8/15 Introduction Projection Algorithms Numerical experiments

Exact projections for exponential divergences

Special case 1: (x) = kxk2: can compute the solution exactly [1]. Special case 2: Exponential divergence: ✏ : (1, +1) ! (✏, +1) u 7! eu−1 ✏, For ✏ = 0: (x) = H(x) = P

i xi ln xi (negative entropy).

D (x, y) = DKL(x, y). For ✏ > 0: (x) = H(x + ✏) D (x, y) = DKL(x + ✏, y + ✏).

[1] J. Duchi, S. Shalev-Schwartz, Y. Singer, T. Chandra, Efficient Projections onto the `1 Ball for Learning in High Dimensions, ICML 2008.

slide-18
SLIDE 18

9/15 Introduction Projection Algorithms Numerical experiments

Motivation

Bregman projection with KL divergence. Hedge algorithm in online learning. Multiplicative weights algorithm. Exponentiated gradient descent. Has closed-form solution in O(d)

slide-19
SLIDE 19

9/15 Introduction Projection Algorithms Numerical experiments

Motivation

Bregman projection with KL divergence. Hedge algorithm in online learning. Multiplicative weights algorithm. Exponentiated gradient descent. Has closed-form solution in O(d) However: DKL(x, y) unbounded on the simplex (problematic for stochastic mirror descent). H(x) is not a smooth function (problematic for accelerated mirror descent). Taking ✏ > 0 solves these issues.

1 p DKL(x, y0) DKL,✏(x, y0)

`✏ 2 kx y0k2 1 L✏ 2 kx y0k2 1

slide-20
SLIDE 20

10/15 Introduction Projection Algorithms Numerical experiments

Optimality conditions

Recall general optimality condition: x?

i =

  • (−1(¯

xi) ¯ gi + ⌫?)

  • +.

Optimality conditions with exponential divergence Let x? be the solution and I = {i : x?

i > 0} its support. Then

8 < : 8i 2 I, x?

i = ✏ + (¯ xi +✏)e−¯

gi

Z?

, Z ? =

P

i∈I(¯

xi +✏)e−¯

gi

1+|I|✏

. (2) Furthermore, if ¯ yi = (¯ xi + ✏)e−¯

gi , then

(i 2 I and ¯ yj > ¯ yi) ) j 2 I

slide-21
SLIDE 21

11/15 Introduction Projection Algorithms Numerical experiments

A sorting-based algorithm

Algorithm 4 Sorting method to compute the Bregman projection with D ✏

1: Input: ¯

x, ¯ g

2: Output: x? 3: Form the vector ¯

yi = (¯ xi + ✏)e−¯

gi 4: Sort ¯

y, let ¯ y(i) be the i-th smallest element of y.

5: Let j? be the smallest index for which

(1 + ✏(d j + 1))¯ y(j) ✏ X

i≥j

¯ y(i) > 0

6: Set Z = P

i≥j? ¯

y(i) 1+✏(d−j?+1) 7: Set

x?

i =

✓ ✏ + ¯ yi Z ◆

+

Complexity: O(d ln d)

slide-22
SLIDE 22

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

Adapted from the QuickSelect algorithm: Select ith element of a vector ¯ y. Can sort then return ith element: O(d ln d). QuickSelect: expected O(d), worst-case O(d2).

slide-23
SLIDE 23

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 k = 5

slide-24
SLIDE 24

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 k = 5

slide-25
SLIDE 25

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 k = 5

slide-26
SLIDE 26

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 k = 5 k = 3

slide-27
SLIDE 27

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 k = 5 k = 3

slide-28
SLIDE 28

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 4 3 5 6 9 8 7 k = 5 k = 3

slide-29
SLIDE 29

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 4 3 5 6 9 8 7 4 3 5 k = 5 k = 3 k = 3

slide-30
SLIDE 30

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 4 3 5 6 9 8 7 4 3 5 k = 5 k = 3 k = 3

slide-31
SLIDE 31

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 4 3 5 6 9 8 7 4 3 5 4 3 5 k = 5 k = 3 k = 3

slide-32
SLIDE 32

12/15 Introduction Projection Algorithms Numerical experiments

A randomized-pivot algorithm

9 1 4 8 7 2 3 5 6 1 2 9 4 8 7 3 5 6 9 4 8 7 3 5 6 4 3 5 6 9 8 7 4 3 5 4 3 5 5 k = 5 k = 3 k = 3

slide-33
SLIDE 33

12/15 Introduction Projection Algorithms Numerical experiments

Outline

1 Introduction 2 Projection Algorithms 3 Numerical experiments

slide-34
SLIDE 34

13/15 Introduction Projection Algorithms Numerical experiments

Scaling of the SortProject and QuickProject

102 103 104 105 106 107 10−5 10−4 10−3 10−2 10−1 100 101 102 d Average run time (s) SortProjection QuickProjection 0.5 1 1.5 2 2.5 3 ·106 1 2 3 4 d Average run time (s) SortProjection QuickProjection Figure: Execution time of the SortProject and QuickProject algorithms, as a function of problem dimension d

slide-35
SLIDE 35

14/15 Introduction Projection Algorithms Numerical experiments

Accelerated entropic descent with and without smoothing

Figure: Entropic descent, with and without smoothing [2].

Offline video

[2] W. Krichene, A. Bayen, P. Bartlett, Accelerated Mirror Descent in Continuous and Discrete Time, NIPS 2015.

slide-36
SLIDE 36

15/15 Introduction Projection Algorithms Numerical experiments

Summary

Bregman projection Method Complexity General divergence Bisection O(ln 1

✏ )

Exponential divergence SortProjection O(d ln d) Exponential divergence QuickProjection O(d) in expection Used for Convex optimization on the simplex. Online learning. Accelerated entropic descent. Code implementation: github.com/walidk

slide-37
SLIDE 37

15/15 Introduction Projection Algorithms Numerical experiments

Summary

Bregman projection Method Complexity General divergence Bisection O(ln 1

✏ )

Exponential divergence SortProjection O(d ln d) Exponential divergence QuickProjection O(d) in expection Used for Convex optimization on the simplex. Online learning. Accelerated entropic descent. Code implementation: github.com/walidk Thank you! eecs.berkeley.edu/⇠walid/

slide-38
SLIDE 38

16/15 Introduction Projection Algorithms Numerical experiments

Accelerated entropic descent with and without smoothing

Back

Figure: Entropic descent, with and without smoothing