SLIDE 1
On Compiling (Online) Combinatorial Learning Problems Fr ed eric - - PowerPoint PPT Presentation
On Compiling (Online) Combinatorial Learning Problems Fr ed eric - - PowerPoint PPT Presentation
On Compiling (Online) Combinatorial Learning Problems Fr ed eric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr Dagstuhl17, New Trends in Knowledge Compilation 1 Outline 1 Online Learning 2 The Convex Case 3 The
SLIDE 2
SLIDE 3
Online Learning Online learning is a zero-sum repeated game between a learning algorithm and its
- environment. The components of the game are:
A class H of hypotheses (the learner’s moves) A space Z of instances (the environment’s moves) A loss function ℓ : H × Z → R (the game “matrix”)
3
SLIDE 4
Online Learning
Learner Environment
During each round t of the game,
4
SLIDE 5
Online Learning
Learner h1 Environment z1
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z
4
SLIDE 6
Online Learning
Learner h1 ℓ(h1, z1) Environment z1
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt)
4
SLIDE 7
Online Learning
Learner h1 h2 ℓ(h1, z1) Environment z1 z2
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.
4
SLIDE 8
Online Learning
Learner h1 h2 ℓ(h1, z1) ℓ(h2, z2) Environment z1 z2
+
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.
4
SLIDE 9
Online Learning
Learner h1 h2 h3 ℓ(h1, z1) ℓ(h2, z2) Environment z1 z2 z3
+
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.
4
SLIDE 10
Online Learning
Learner h1 h2 h3 ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) Environment z1 z2 z3
+ +
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.
4
SLIDE 11
Online Learning
Learner h1 h2 h3 ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . Environment z1 z2 z3
+ +
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.
4
SLIDE 12
Online Learning
Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . Environment z1 z2 z3 zT
+ +
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.
4
SLIDE 13
Online Learning
Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT
+ + + +
During each round t of the game, The learner plays a hypothesis ht ∈ H Simultaneously, the environment plays an instance zt ∈ Z Then zt is revealed to the learner, which incurs the loss ℓ(ht, zt) The goal for the learner is to minimize its cumulative loss over the course of the game.
4
SLIDE 14
Online Learning
Example: Online Linear Classification On each round t,
5
SLIDE 15
Online Learning
ht zt Example: Online Linear Classification On each round t, The learner plays a separating hyperplane ht = signwt, · Simultaneously, the environment plays a labeled example zt = (xt, yt)
5
SLIDE 16
Online Learning
ht zt Example: Online Linear Classification On each round t, The learner plays a separating hyperplane ht = signwt, · Simultaneously, the environment plays a labeled example zt = (xt, yt) Then, the learner incurs the hinge loss ℓ(ht, zt) = max(0, 1 − ytxt, wt)
5
SLIDE 17
Online Learning
1 z1 z2 z3 z4 z5 z6 z7 z8
Example: Online Density Estimation On each round t,
6
SLIDE 18
Online Learning
1 z1 z2 z3 z4 z5 z6 z7 z8
ht zt Example: Online Density Estimation On each round t, The learner plays a probability distribution ht over Z Simultaneously, the environment plays an instance zt
6
SLIDE 19
Online Learning
1 z1 z2 z3 z4 z5 z6 z7 z8
ht zt Example: Online Density Estimation On each round t, The learner plays a probability distribution ht over Z Simultaneously, the environment plays an instance zt Then, the learner incurs the log loss ℓ(ht, zt) = − ln ht(zt)
6
SLIDE 20
Online Learning
Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT
+ + + +
Online learning can be applied to a wide range of tasks, ranging
7
SLIDE 21
Online Learning
Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT
+ + + +
D Online learning can be applied to a wide range of tasks, ranging from statistical learning, where the environment is an oblivious player modelled by a fixed probability distribution D over Z,
7
SLIDE 22
Online Learning
Learner h1 h2 h3 hT ℓ(h1, z1) ℓ(h2, z2) ℓ(h3, z3) . . . ℓ(hT , zT ) Environment z1 z2 z3 zT
+ + + +
D1 D2 D3 DT Online learning can be applied to a wide range of tasks, ranging from statistical learning, where the environment is an oblivious player modelled by a fixed probability distribution D over Z, to adversarial learning, where the environment is an active player who changes its distribution at each iteration in response to the learner’s moves.
7
SLIDE 23
Online Learning In a nutshell, online learning is particularly suited to: Adaptive environments, where the data distribution can change over time Streaming applications, where all the data is not available in advance Large-scale datasets, by processing only one instance at a time
8
SLIDE 24
Online Learning
The performance of an online learning algorithm A is measured according to two metrics:
9
SLIDE 25
Online Learning
The performance of an online learning algorithm A is measured according to two metrics: Minimax Regret Defined by the maximum, over every sequence z1:T = (z1, · · · , zT) ∈ ZT, of the cumulative relative loss between A and the best hypothesis in H: max
z1:T ∈ZT
T
- t=1
ℓ(ht, zt) − min
h∈H T
- t=1
ℓ(h, zt)
- A is Hannan consistent if its minimax regret is sublinear in T.
9
SLIDE 26
Online Learning
The performance of an online learning algorithm A is measured according to two metrics: Minimax Regret Defined by the maximum, over every sequence z1:T = (z1, · · · , zT) ∈ ZT, of the cumulative relative loss between A and the best hypothesis in H: max
z1:T ∈ZT
T
- t=1
ℓ(ht, zt) − min
h∈H T
- t=1
ℓ(h, zt)
- A is Hannan consistent if its minimax regret is sublinear in T.
Per-round Complexity Given by the amount of computational operations spent by A at each trial t, for choosing a hypothesis ht in H, and evaluating its loss ℓ(ht, zt).
9
SLIDE 27
Outline
1 Online Learning 2 The Convex Case 3 The Combinatorial Case 4 Compiling Hedge
10
SLIDE 28
The Convex Case
b
H = {w ∈ Rn : w ≤ b} −1 1 2 ytwt, xt 1 2 3 loss ℓ(ht, zt) = max(0, 1 − ytwt, xt) (in blue)
Online Convex Learning An online learning problem (H, Z, ℓ) is convex if: H is a closed convex subset of Rd ℓ is convex in its first argument, i.e. ℓ(·, z) is convex for all z ∈ Z
11
SLIDE 29
The Convex Case
Online Gradient Descent Start with the vector w1 = 0. During each round t Play with wt Observe zt and incur loss ℓ(wt, zt) Update the hypothesis as follows: wt+1 = argmin
w∈H
- w − w ′
t
- 2 where w ′
t = wt − ηt∇ℓ(wt, zt) 12
SLIDE 30
The Convex Case
Online Gradient Descent Start with the vector w1 = 0. During each round t Play with wt Observe zt and incur loss ℓ(wt, zt) Update the hypothesis as follows: wt+1 = argmin
w∈H
- w − w ′
t
- 2 where w ′
t = wt − ηt∇ℓ(wt, zt)
The regret of OGD with respect to any w ∗ ∈ H is bounded by w ∗2 ηT + 1 2
T
- t=1
ηt ∇ℓ(wt, zt)2 Thus, if ℓ is L-Lipschitz and H is D-bounded, then using ηt = D/L√t, OGD is Hannan consistent with regret 2DL √ T
12
SLIDE 31
Outline
1 Online Learning 2 The Convex Case 3 The Combinatorial Case 4 Compiling Hedge
13
SLIDE 32
The Combinatorial Case
In various situations, the hypothesis class is not a convex set, but involves combinatorial aspects which make the learning problem much more difficult to solve.
14
SLIDE 33
The Combinatorial Case
In various situations, the hypothesis class is not a convex set, but involves combinatorial aspects which make the learning problem much more difficult to solve. Example: Learning Boolean Functions H is a class of Boolean Functions (monomials, clauses, k-DNF, k-term DNF, etc.) ℓ is the zero-one loss function (which counts the number of misclassifications)
14
SLIDE 34
The Combinatorial Case
∅ x1 x2 x3 x4 x1x2 x1x3 x1x4 x2x3 x2x4 x3x4 x1x2x3 x1x2x4 x1x3x4 x2x3x4 x1x2x3x4 = argminh∈H t
i=1 ℓ(h, zi )
Let H be the class of monotone monomials over {0, 1}d. Then, finding a hypothesis h ∈ H that minimizes the cumulative zero-one loss over an abitrary set (z1, · · · , zt) of labeled instances in {0, 1}d × (0, 1) is NP-hard (Kearns et. al., 1994).
15
SLIDE 35
The Combinatorial Case
Hedge Start from the uniform distribution p1 over H. On each round t Play ht at random according to pt Observe zt and incur loss ℓ(ht, zt) Update the distribution as follows: pt+1(h) = 1 Zt pt(h)e−ηtℓ(h,zt) where Zt =
- h′∈H
pt(h′)e−ηtℓ(h′,zt)
16
SLIDE 36
The Combinatorial Case
Hedge Start from the uniform distribution p1 over H. On each round t Play ht at random according to pt Observe zt and incur loss ℓ(ht, zt) Update the distribution as follows: pt+1(h) = 1 Zt pt(h)e−ηtℓ(h,zt) where Zt =
- h′∈H
pt(h′)e−ηtℓ(h′,zt) The regret of Hedge with respect to any h∗ ∈ H is bounded by ln |H| ηT + 1 2
T
- t=1
ηt ℓ(ht, zt)2
∞
Thus, if ℓ is B-bounded, then using ηt = 1/B
- ln |H|/t, Hedge is Hannan consistent
with regret 2B
- T ln |H|
16
SLIDE 37
The Combinatorial Case
In various situations, the hypothesis class is not a convex set, but involves combinatorial aspects which make the learning problem much more difficult to solve. Example: Learning Boolean Functions H is a class of Boolean Functions (monomials, clauses, k-DNF, k-term DNF, etc.) ℓ is the zero-one loss function (which counts the number of misclassifications) Example: Learning Probabilistic Graphical Models H is a class of probability distributions, each represented by a graphical structure G and a set θ of parameters. ℓ is the log-loss (which estimates the log-likelihood of the model)
17
SLIDE 38
The Combinatorial Case
G Θ
argminh∈H t
i=1 ℓ(h, zi)
Let G be the class of triangulated graphs over {1, · · · , d} and Θ be the class of marginal distributions over 3-cliques. Then, finding a markov net (G, θ) that minimizes the cumulative log-loss over an arbitrary set (z1, · · · , zt) of instances in {0, 1}d is NP-hard (Srebo, 2003).
18
SLIDE 39
The Combinatorial Case
Hedge + OGD Start from the uniform distribution p1 over G, an the parameter vector θ1 = 0 Play ht = (Gt, θt), where Gt ∼ pt Observe zt and incur loss ℓ(ht, zt) Update the distribution pt with the Hedge rule Update the paremeters θt with the OGD rule
19
SLIDE 40
The Combinatorial Case
Hedge + OGD Start from the uniform distribution p1 over G, an the parameter vector θ1 = 0 Play ht = (Gt, θt), where Gt ∼ pt Observe zt and incur loss ℓ(ht, zt) Update the distribution pt with the Hedge rule Update the paremeters θt with the OGD rule If ℓ is L-Lipschitz with respect to Θ and B-bounded with respect to G, then the Hedge + OGD strategy is Hannan consistent with regret in O(dBL
- T ln |G|)
19
SLIDE 41
Outline
1 Online Learning 2 The Convex Case 3 The Combinatorial Case 4 Compiling Hedge
20
SLIDE 42
Compiling Hedge
∅ x1 x2 x3 x4 x1x2 x1x3 x1x4 x2x3 x2x4 x3x4 x1x2x3 x1x2x4 x1x3x4 x2x3x4 x1x2x3x4
Hedge maintains a high-dimensional probability distribution. Even for the simple class H
- f monotone monomials, we have to store |H| = 2d weights.
21
SLIDE 43
Compiling Hedge
∅ x1 x2 x3 x4 x1x2 x1x3 x1x4 x2x3 x2x4 x3x4 x1x2x3 x1x2x4 x1x3x4 x2x3x4 x1x2x3x4
Yet, if we consider H as a set of feasible solutions, each probability distribution pt can be represented in a declarative and compact way, using weighted constraints.
21
SLIDE 44
Compiling Hedge
1 ∅ 1 x1 1 x2 1 x3 1 x4 1 x1x2 1 x1x3 1 x1x4 1 x2x3 1 x2x4 1 x3x4 1 x1x2x3 1 x1x2x4 1 x1x3x4 1 x2x3x4 1 x1x2x3x4
⊤ : 1
C0
Suppose we start from the uniform distribution p1 over 2d. Here p1 can be trivially represented by the universal constraint C0 over the Boolean variables {x1, · · · , xn}.
22
SLIDE 45
Compiling Hedge
1 ∅
1 eη1
x1
1 eη1
x2 1 x3 1 x4
1 eη1
x1x2
1 eη1
x1x3
1 eη1
x1x4
1 eη1
x2x3
1 eη1
x2x4 1 x3x4
1 eη1
x1x2x3
1 eη1
x1x2x4
1 eη1
x1x3x4
1 eη1
x2x3x4
1 eη1
x1x2x3x4
⊤ : 1
C0
x1 ∨ x2 :
1 eη1
C1
After receiving the labeled instance z1 = ((0, 0, 1, 1), 1), the distribution p2 can be represented by the weighted boolean constraints {C0, C1}.
22
SLIDE 46
Compiling Hedge
1 eη2
∅
1 eη1+η2
x1
1 eη1+η2
x2 1 x3 1 x4
1 eη1+η2
x1x2
1 eη1
x1x3
1 eη1
x1x4
1 eη1
x2x3
1 eη1
x2x4 1 x3x4
1 eη1
x1x2x3
1 eη1
x1x2x4
1 eη1
x1x3x4
1 eη1
x2x3x4
1 eη1
x1x2x3x4
⊤ : 1
C0
x1 ∨ x2 :
1 eη1
C1
¬x3 ∨ ¬x4 :
1 eη2
C2
And, after receiving z2 = ((1, 1, 0, 0), 0), the distribution p3 can be described by the weighted boolean constraints {C0, C1, C2}.
22
SLIDE 47
Compiling Hedge
G
= (0, 0, 1, 0, 0, · · · , 0, 1, 1, 0, 1)
For the class of triangulated graphical models, G can be viewed as a subset of {0, 1}n, where n = d
3
- .
23
SLIDE 48
Compiling Hedge
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 g is a triangulated graph
C0
The initial distribution p1 can be encoded by a constraint (set) C0 restricting feasible solutions to triangulated graphs.
24
SLIDE 49
Compiling Hedge
g1, θ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . g36, θ1 g is triangulated
C0
g1 = 1 : θt1 . . . gn = 1 : θtn
Ct
For the log-loss, each incoming instance zt gives rise to a set of n weighted unary constraints Ct.
24
SLIDE 50
Compiling Hedge
Open Question 1: Which combinatorial hypothesis classes H enjoy the following property? The probability distribution pt over H maintained by Hedge (or its variants) can be encoded into a weighted constraint network N of size polynomial in d.
25
SLIDE 51