On Compiling (Online) Combinatorial Learning Problems Fr ed eric - - PowerPoint PPT Presentation

on compiling online combinatorial learning problems
SMART_READER_LITE
LIVE PREVIEW

On Compiling (Online) Combinatorial Learning Problems Fr ed eric - - PowerPoint PPT Presentation

On Compiling (Online) Combinatorial Learning Problems Fr ed eric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr Dagstuhl17, New Trends in Knowledge Compilation 1 Outline 1 Online Learning 2 The Convex Case 3 The


slide-1
SLIDE 1

On Compiling (Online) Combinatorial Learning Problems

Fr´ ed´ eric Koriche

CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr Dagstuhl’17, New Trends in Knowledge Compilation

1

slide-2
SLIDE 2

Outline

1 Online Learning 2 The Convex Case 3 The Combinatorial Case 4 Compiling Hedge

2

slide-3
SLIDE 3

Online Learning Online learning is a zero-sum repeated game between a learning algorithm and its

  • environment. The components of the game are:

A class H of hypotheses (the learner’s moves) A space Z of instances (the environment’s moves) A loss function ℓ : H × Z → R (the game “matrix”)

3

slide-4
SLIDE 4

Online Learning

Learner Environment

During each round t of the game,

4

slide-5
SLIDE 5

Online Learning

Learner h1 Environment z1

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z

4

slide-6
SLIDE 6

Online Learning

Learner h1 ℓ(h1, z1) Environment z1

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt)

4

slide-7
SLIDE 7

Online Learning

Learner h1 h2 ℓ(h1, z1) Environment z1 z2

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.

4

slide-8
SLIDE 8

Online Learning

Learner h1 h2 ℓ(h1, z1) ℓ(h2, z2) Environment z1 z2

+

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.

4

slide-9
SLIDE 9

Online Learning

Learner h1 h2 h3 ℓ(h1, z1) ℓ(h2, z2) Environment z1 z2 z3

+

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.

4

slide-10
SLIDE 10

Online Learning

Learner h1 h2 h3 ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) Environment z1 z2 z3

+ +

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.

4

slide-11
SLIDE 11

Online Learning

Learner h1 h2 h3 ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . Environment z1 z2 z3

+ +

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.

4

slide-12
SLIDE 12

Online Learning

Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . Environment z1 z2 z3 zT

+ +

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.

4

slide-13
SLIDE 13

Online Learning

Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT

+ + + +

During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.

4

slide-14
SLIDE 14

Online Learning

Example: Online Linear Classification On each round t,

5

slide-15
SLIDE 15

Online Learning

ht zt Example: Online Linear Classification On each round t, The learner plays a separating hyperplane ht = signwt, · Simultaneously, the environment plays a labeled example zt = (xt, yt)

5

slide-16
SLIDE 16

Online Learning

ht zt Example: Online Linear Classification On each round t, The learner plays a separating hyperplane ht = signwt, · Simultaneously, the environment plays a labeled example zt = (xt, yt) Then, the learner incurs the hinge loss ℓ(ht, zt) = max(0, 1 − ytxt, wt)

5

slide-17
SLIDE 17

Online Learning

1 z1 z2 z3 z4 z5 z6 z7 z8

Example: Online Density Estimation On each round t,

6

slide-18
SLIDE 18

Online Learning

1 z1 z2 z3 z4 z5 z6 z7 z8

ht zt Example: Online Density Estimation On each round t, The learner plays a probability distribution ht over Z Simultaneously, the environment plays an instance zt

6

slide-19
SLIDE 19

Online Learning

1 z1 z2 z3 z4 z5 z6 z7 z8

ht zt Example: Online Density Estimation On each round t, The learner plays a probability distribution ht over Z Simultaneously, the environment plays an instance zt Then, the learner incurs the log loss ℓ(ht, zt) = − ln ht(zt)

6

slide-20
SLIDE 20

Online Learning

Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT

+ + + +

Online learning can be applied to a wide range of tasks, ranging

7

slide-21
SLIDE 21

Online Learning

Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT

+ + + +

D Online learning can be applied to a wide range of tasks, ranging from statistical learning, where the environment is an oblivious player modelled by a fixed probability distribution D over Z,

7

slide-22
SLIDE 22

Online Learning

Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT

+ + + +

D1 D2 D3 DT Online learning can be applied to a wide range of tasks, ranging from statistical learning, where the environment is an oblivious player modelled by a fixed probability distribution D over Z, to adversarial learning, where the environment is an active player who changes its distribution at each iteration in response to the learner’s moves.

7

slide-23
SLIDE 23

Online Learning In a nutshell, online learning is particularly suited to: Adaptive environments, where the data distribution can change over time Streaming applications, where all the data is not available in advance Large-scale datasets, by processing only one instance at a time

8

slide-24
SLIDE 24

Online Learning

The performance of an online learning algorithm A is measured according to two metrics:

9

slide-25
SLIDE 25

Online Learning

The performance of an online learning algorithm A is measured according to two metrics: Minimax Regret Defined by the maximum, over every sequence z1:T = (z1, · · · , zT) ∈ ZT, of the cumulative relative loss between A and the best hypothesis in H: max

z1:T ∈ZT

T

  • t=1

ℓ(ht, zt) − min

h∈H T

  • t=1

ℓ(h, zt)

  • A is Hannan consistent if its minimax regret is sublinear in T.

9

slide-26
SLIDE 26

Online Learning

The performance of an online learning algorithm A is measured according to two metrics: Minimax Regret Defined by the maximum, over every sequence z1:T = (z1, · · · , zT) ∈ ZT, of the cumulative relative loss between A and the best hypothesis in H: max

z1:T ∈ZT

T

  • t=1

ℓ(ht, zt) − min

h∈H T

  • t=1

ℓ(h, zt)

  • A is Hannan consistent if its minimax regret is sublinear in T.

Per-round Complexity Given by the amount of computational operations spent by A at each trial t, for choosing a hypothesis ht in H, and evaluating its loss ℓ(ht, zt).

9

slide-27
SLIDE 27

Outline

1 Online Learning 2 The Convex Case 3 The Combinatorial Case 4 Compiling Hedge

10

slide-28
SLIDE 28

The Convex Case

b

H = {w ∈ Rn : w ≤ b} −1 1 2 ytwt, xt 1 2 3 loss ℓ(ht, zt) = max(0, 1 − ytwt, xt) (in blue)

Online Convex Learning An online learning problem (H, Z, ℓ) is convex if: H is a closed convex subset of Rd ℓ is convex in its first argument, i.e. ℓ(·, z) is convex for all z ∈ Z

11

slide-29
SLIDE 29

The Convex Case

Online Gradient Descent Start with the vector w1 = 0. During each round t Play with wt Observe zt and incur loss ℓ(wt, zt) Update the hypothesis as follows: wt+1 = argmin

w∈H

  • w − w ′

t

  • 2 where w ′

t = wt − ηt∇ℓ(wt, zt) 12

slide-30
SLIDE 30

The Convex Case

Online Gradient Descent Start with the vector w1 = 0. During each round t Play with wt Observe zt and incur loss ℓ(wt, zt) Update the hypothesis as follows: wt+1 = argmin

w∈H

  • w − w ′

t

  • 2 where w ′

t = wt − ηt∇ℓ(wt, zt)

The regret of OGD with respect to any w ∗ ∈ H is bounded by w ∗2 ηT + 1 2

T

  • t=1

ηt ∇ℓ(wt, zt)2 Thus, if ℓ is L-Lipschitz and H is D-bounded, then using ηt = D/L√t, OGD is Hannan consistent with regret 2DL √ T

12

slide-31
SLIDE 31

Outline

1 Online Learning 2 The Convex Case 3 The Combinatorial Case 4 Compiling Hedge

13

slide-32
SLIDE 32

The Combinatorial Case

In various situations, the hypothesis class is not a convex set, but involves combinatorial aspects which make the learning problem much more difficult to solve.

14

slide-33
SLIDE 33

The Combinatorial Case

In various situations, the hypothesis class is not a convex set, but involves combinatorial aspects which make the learning problem much more difficult to solve. Example: Learning Boolean Functions H is a class of Boolean Functions (monomials, clauses, k-DNF, k-term DNF, etc.) ℓ is the zero-one loss function (which counts the number of misclassifications)

14

slide-34
SLIDE 34

The Combinatorial Case

∅ x1 x2 x3 x4 x1x2 x1x3 x1x4 x2x3 x2x4 x3x4 x1x2x3 x1x2x4 x1x3x4 x2x3x4 x1x2x3x4 = argminh∈H t

i=1 ℓ(h, zi )

Let H be the class of monotone monomials over {0, 1}d. Then, finding a hypothesis h ∈ H that minimizes the cumulative zero-one loss over an abitrary set (z1, · · · , zt) of labeled instances in {0, 1}d × (0, 1) is NP-hard (Kearns et. al., 1994).

15

slide-35
SLIDE 35

The Combinatorial Case

Hedge Start from the uniform distribution p1 over H. On each round t Play ht at random according to pt Observe zt and incur loss ℓ(ht, zt) Update the distribution as follows: pt+1(h) = 1 Zt pt(h)e−ηtℓ(h,zt) where Zt =

  • h′∈H

pt(h′)e−ηtℓ(h′,zt)

16

slide-36
SLIDE 36

The Combinatorial Case

Hedge Start from the uniform distribution p1 over H. On each round t Play ht at random according to pt Observe zt and incur loss ℓ(ht, zt) Update the distribution as follows: pt+1(h) = 1 Zt pt(h)e−ηtℓ(h,zt) where Zt =

  • h′∈H

pt(h′)e−ηtℓ(h′,zt) The regret of Hedge with respect to any h∗ ∈ H is bounded by ln |H| ηT + 1 2

T

  • t=1

ηt ℓ(ht, zt)2

Thus, if ℓ is B-bounded, then using ηt = 1/B

  • ln |H|/t, Hedge is Hannan consistent

with regret 2B

  • T ln |H|

16

slide-37
SLIDE 37

The Combinatorial Case

In various situations, the hypothesis class is not a convex set, but involves combinatorial aspects which make the learning problem much more difficult to solve. Example: Learning Boolean Functions H is a class of Boolean Functions (monomials, clauses, k-DNF, k-term DNF, etc.) ℓ is the zero-one loss function (which counts the number of misclassifications) Example: Learning Probabilistic Graphical Models H is a class of probability distributions, each represented by a graphical structure G and a set θ of parameters. ℓ is the log-loss (which estimates the log-likelihood of the model)

17

slide-38
SLIDE 38

The Combinatorial Case

G Θ

argminh∈H t

i=1 ℓ(h, zi)

Let G be the class of triangulated graphs over {1, · · · , d} and Θ be the class of marginal distributions over 3-cliques. Then, finding a markov net (G, θ) that minimizes the cumulative log-loss over an arbitrary set (z1, · · · , zt) of instances in {0, 1}d is NP-hard (Srebo, 2003).

18

slide-39
SLIDE 39

The Combinatorial Case

Hedge + OGD Start from the uniform distribution p1 over G, an the parameter vector θ1 = 0 Play ht = (Gt, θt), where Gt ∼ pt Observe zt and incur loss ℓ(ht, zt) Update the distribution pt with the Hedge rule Update the paremeters θt with the OGD rule

19

slide-40
SLIDE 40

The Combinatorial Case

Hedge + OGD Start from the uniform distribution p1 over G, an the parameter vector θ1 = 0 Play ht = (Gt, θt), where Gt ∼ pt Observe zt and incur loss ℓ(ht, zt) Update the distribution pt with the Hedge rule Update the paremeters θt with the OGD rule If ℓ is L-Lipschitz with respect to Θ and B-bounded with respect to G, then the Hedge + OGD strategy is Hannan consistent with regret in O(dBL

  • T ln |G|)

19

slide-41
SLIDE 41

Outline

1 Online Learning 2 The Convex Case 3 The Combinatorial Case 4 Compiling Hedge

20

slide-42
SLIDE 42

Compiling Hedge

∅ x1 x2 x3 x4 x1x2 x1x3 x1x4 x2x3 x2x4 x3x4 x1x2x3 x1x2x4 x1x3x4 x2x3x4 x1x2x3x4

Hedge maintains a high-dimensional probability distribution. Even for the simple class H

  • f monotone monomials, we have to store |H| = 2d weights.

21

slide-43
SLIDE 43

Compiling Hedge

∅ x1 x2 x3 x4 x1x2 x1x3 x1x4 x2x3 x2x4 x3x4 x1x2x3 x1x2x4 x1x3x4 x2x3x4 x1x2x3x4

Yet, if we consider H as a set of feasible solutions, each probability distribution pt can be represented in a declarative and compact way, using weighted constraints.

21

slide-44
SLIDE 44

Compiling Hedge

1 ∅ 1 x1 1 x2 1 x3 1 x4 1 x1x2 1 x1x3 1 x1x4 1 x2x3 1 x2x4 1 x3x4 1 x1x2x3 1 x1x2x4 1 x1x3x4 1 x2x3x4 1 x1x2x3x4

⊤ : 1

C0

Suppose we start from the uniform distribution p1 over 2d. Here p1 can be trivially represented by the universal constraint C0 over the Boolean variables {x1, · · · , xn}.

22

slide-45
SLIDE 45

Compiling Hedge

1 ∅

1 eη1

x1

1 eη1

x2 1 x3 1 x4

1 eη1

x1x2

1 eη1

x1x3

1 eη1

x1x4

1 eη1

x2x3

1 eη1

x2x4 1 x3x4

1 eη1

x1x2x3

1 eη1

x1x2x4

1 eη1

x1x3x4

1 eη1

x2x3x4

1 eη1

x1x2x3x4

⊤ : 1

C0

x1 ∨ x2 :

1 eη1

C1

After receiving the labeled instance z1 = ((0, 0, 1, 1), 1), the distribution p2 can be represented by the weighted boolean constraints {C0, C1}.

22

slide-46
SLIDE 46

Compiling Hedge

1 eη2

1 eη1+η2

x1

1 eη1+η2

x2 1 x3 1 x4

1 eη1+η2

x1x2

1 eη1

x1x3

1 eη1

x1x4

1 eη1

x2x3

1 eη1

x2x4 1 x3x4

1 eη1

x1x2x3

1 eη1

x1x2x4

1 eη1

x1x3x4

1 eη1

x2x3x4

1 eη1

x1x2x3x4

⊤ : 1

C0

x1 ∨ x2 :

1 eη1

C1

¬x3 ∨ ¬x4 :

1 eη2

C2

And, after receiving z2 = ((1, 1, 0, 0), 0), the distribution p3 can be described by the weighted boolean constraints {C0, C1, C2}.

22

slide-47
SLIDE 47

Compiling Hedge

G

= (0, 0, 1, 0, 0, · · · , 0, 1, 1, 0, 1)

For the class of triangulated graphical models, G can be viewed as a subset of {0, 1}n, where n = d

3

  • .

23

slide-48
SLIDE 48

Compiling Hedge

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 g is a triangulated graph

C0

The initial distribution p1 can be encoded by a constraint (set) C0 restricting feasible solutions to triangulated graphs.

24

slide-49
SLIDE 49

Compiling Hedge

g1, θ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . g36, θ1 g is triangulated

C0

g1 = 1 : θt1 . . . gn = 1 : θtn

Ct

For the log-loss, each incoming instance zt gives rise to a set of n weighted unary constraints Ct.

24

slide-50
SLIDE 50

Compiling Hedge

Open Question 1: Which combinatorial hypothesis classes H enjoy the following property? The probability distribution pt over H maintained by Hedge (or its variants) can be encoded into a weighted constraint network N of size polynomial in d.

25

slide-51
SLIDE 51

Compiling Hedge

Open Question 1: Which combinatorial hypothesis classes H enjoy the following property? The probability distribution pt over H maintained by Hedge (or its variants) can be encoded into a weighted constraint network N of size polynomial in d. Open Question 2: Which target compilation languages L for N are enjoying the following properties? The (Random Generation) RG query ”draw a model at random according to the representation of N in L” is efficient. The (Conjunctive Bounded Closure) ∧BC transformation ”map a representation N of L and a weighted constraint Ct to a new representation in L” is efficient.

25