SLIDE 1
Online Learning Fei Xia Language Technology Institute - - PowerPoint PPT Presentation
Online Learning Fei Xia Language Technology Institute - - PowerPoint PPT Presentation
Online Learning Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015 1 / 24 Outline Introduction Why online learning Basic stuff about online learning Prediction with expert advice Halving algorithm Weighted majority
SLIDE 2
SLIDE 3
Why online learning?
In many cases, data arrives sequentially while predictions are required on-the-fly Online algorithms do not require any distributional assumption Applicable in adversarial environments Simple algorithms Theoretical guarantees
3 / 24
SLIDE 4
Introduction
Basic Properties: Instead of learning from a training set and then testing on a test set, the online learning scenario mixes the training and test phases. Instead of assuming the distribution over data points is fixed both for training/test points and points are sampled in and i.i.d fashion, no distributional assumption is assumed in online learning. Instead of learning a hypothesis with small generalization error, online learning algorithms is measured using a mistake model and the regret.
4 / 24
SLIDE 5
Introduction
Basic Setting: For t = 1, 2, ..., T Receive an instance xt ∈ X Make a prediction ˆ yt ∈ Y Receive true label yt ∈ Y Suffer loss L(ˆ yt, yt) Objective: min
T
- t=1
L(ˆ yt, yt)
5 / 24
SLIDE 6
Prediction with Expert Advice
For t = 1, 2, ..., T Receive an instance xt ∈ X Receive an advice yt,i ∈ Y, i ∈ [1, N] from N experts Make a prediction ˆ yt ∈ Y Receive true label yt ∈ Y Suffer loss L(ˆ yt, yt)
Figure : Weather forecast: an example of a prediction problem based on expert advice [Mohri et al., 2012] 6 / 24
SLIDE 7
Regret Analysis
Objective: minimize the regret RT RT =
T
- t=1
L(ˆ yt, yt) −
N
min
i=1 T
- t=1
L(ˆ yt,i, yt) What does low regret mean? It means that we don’t lose much from not knowing future events It means that we can perform almost as well as someone who
- bserves the entire sequence and picks the best prediction
strategy in hindsight It means that we can compete with changing environment
7 / 24
SLIDE 8
Halving algorithm
Realizable case — After some number of rounds T, we will learn the concept and no longer make errors. Mistake bound — How many mistakes before we learn a particular concept? Maximum number of mistakes a learning algorithm A makes for a concept c: MA(c) = max
S
|mistakes(A, c)| Maximum number of mistakes a learning algorithm A makes for a concept class C: MA(C) = max
c∈C MA(c)
8 / 24
SLIDE 9
Halving algorithm
Algorithm 1 HALVING(H)
1: H1 ← H 2: for t ← 1 to T do 3:
RECEIVE(xt)
4:
ˆ yt ← MAJORITYVOTE(Ht, xt)
5:
RECEIVE(yt)
6:
if ˆ yt = yt then
7:
Ht+1 ← {c ∈ Ht : c(xt) = yt}
8: return HT+1
9 / 24
SLIDE 10
Halving algorithm
Theorem Let H be a finite hypothesis set, then MHalving(H) ≤ log2 |H| Proof. The algorithm makes predictions using majority vote from the active set. Thus, at each mistake, the active set is reduced by at least half. Hence, after log2 |H| mistakes, there can only remain
- ne active hypothesis. Since we are in the realizable case, this
hypothesis must coincide with the target concept and we won’t make mistakes any more.
10 / 24
SLIDE 11
Weighted majority algorithm
Algorithm 2 WEIGHTED-MAJORITY(N)
1: for i ← 1 to N do 2:
w1,i ← 1
3: for t ← 1 to T do 4:
RECEIVE(xt)
5:
if
i:yt,i=1 wt,i ≥ i:yt,i=0 wt,i then
6:
ˆ yt ← 1
7:
else
8:
ˆ yt ← 0
9:
RECEIVE(yt)
10:
if ˆ yt = yt then
11:
for i ← 1 to N do
12:
if yt,i = yt then
13:
wt+1,i ← βwt,i
14:
else wt+1,i ← wt,i
15: return wT +1
11 / 24
SLIDE 12
Weighted majority algorithm
Theorem Fix β ∈ (0, 1). Let mT be the number of mistakes made by algorithm WM after T ≥ 1 rounds, and m∗
T be the number of
mistakes made by the best of the N experts. Then, the following inequality holds: mT ≤ log N + m∗
T log 1 β
log
2 1+β
Proof. Introduce a potential function Wt = N
1 wt,i, then derive its upper
and lower bound. Since the predictions are generated using weighted majority vote, if the algorithm makes an error at round t, we have Wt+1 ≤ 1 + β 2
- Wt
12 / 24
SLIDE 13
Weighted majority algorithm
Proof (Cont.) mT mistakes after T rounds and W1 = N, thus WT ≤ 1 + β 2 mT N Note that we also have WT ≥ wT,i = βmT,i where mT,i is the number of mistakes made by the ith expert. Thus, βm∗
T ≤
1 + β 2 mT N ⇒ mT ≤ log N + m∗
T log 1 β
log
2 1+β
13 / 24
SLIDE 14
Weighted majority algorithm
mT ≤
log N+m∗
T log 1 β
log
2 1+β
mT ≤ O(log N) + constant × |mistakes of best expert| No assumption about the sequence of samples The number of mistakes is roughly a constant times that of the best expert in hindsight When m∗
T = 0, the bound reduces to mT ≤ O(log N), which
is the same as the Halving algorithm
14 / 24
SLIDE 15
Randomized weighted majority algorithm
Drawback in weighted majority algorithm: zero-one loss; no deterministic algorithm can achieve a regret RT = o(T) In randomized scenario, A = {1, ..., N} of N actions is available At each round t ∈ [1, T], an online algorithm A selects a distribution pt over the set of actions Receive a loss vector lt, where lt,i ∈ 0, 1 is the loss associated with action i Define the expected loss for round t: Lt = N
i=1 pt,ilt,i; the
total loss for T rounds: LT = T
t=1 Lt
Define the total loss associated with action i: L = T
t=1 lt,i;
the minimal loss of a single action: Lmin
T
= mini∈A LT,i
15 / 24
SLIDE 16
Randomized weighted majority algorithm
Algorithm 3 RANDOMIZED-WEIGHTED-MAJORITY(N)
1: for i ← 1 to N do 2:
w1,i ← 1
3:
p1,i ← 1/N
4: for t ← 1 to T do 5:
for i ← 1 to N do
6:
if lt,i = 1 then
7:
wt+1,i ← βwt,i
8:
else wt+1,i ← wt,i
9:
Wt+1 ← N
i=1 wt+1,i
10:
for i ← 1 to N do
11:
pt+1,i ← wt+1,i/Wt+1
12: return wT +1
Note: Let w0 be the total weight on outcome 0, w1 be the total weight
- n outcome 1, W = w1 + w2; then the prediction strategy is to predict i
with probability wi/W. 16 / 24
SLIDE 17
Randomized weighted majority algorithm
Theorem Fix β ∈ [1/2, 1]. Then for any T ≥ 1, the loss of algorithm RWM
- n any sequence can be bounded as follows:
LT ≤ log N 1 − β + (2 − β)Lmin
T
In particular, for β = max{1/2, 1 −
- (log N)/T}, the loss can be
bounded as: LT ≤ Lmin + 2
- T log N
Proof. Define potential function Wt = N
i=1 wt,i, t ∈ [1, T].
17 / 24
SLIDE 18
Proof (Cont.) Wt+1 =
i:lt,i=0 wt,i + β i:lt,i=1 wt,i
= Wt + (β − 1)Wt
- i:lt,i=1 pt,i
= Wt(1 − (1 − β)Lt) ⇒ WT +1 = N T
t=1(1 − (1 − β)Lt)
Note that we also have WT +1 ≥ maxi∈[1,N] wT +1,i = βLmin
T
, thus, βLmin
T
≤ N T
t=1(1 − (1 − β)Lt)
⇒ Lmin
T
log β ≤ log N − (1 − β)LT ⇒ LT ≤ log N
1−β + (2 − β)Lmin T
Since Lmin
T
≤ T, this also implies LT ≤ log N 1 − β + (1 − β)T + Lmin
T
By minimizing the RHS w.r.t β, we get LT ≤ Lmin
T
+ 2
- T log N ⇔ RT ≤ 2
- T log N
18 / 24
SLIDE 19
Exponential weighted average algorithm
We have extended WM algorithm to other loss functions L taking values in [0,1]. The EWA algorithm here is a further extension such that L is convex in its first argument. Algorithm 4 EXPONENTIAL-WEIGHTED-AVERAGE(N)
1: for i ← 1 to N do 2:
w1,i ← 1
3: for t ← 1 to T do 4:
RECEIVE(xt)
5:
ˆ yt ←
N
i=1 wt,iyt,i
N
i=1 wt,i
6:
RECEIVE(yt)
7:
for i ← 1 to N do
8:
wt+1,i ← wt,ie−ηL(ˆ
yt,i,yt)
9: return wT+1
19 / 24
SLIDE 20
Exponential weighted average algorithm
Theorem Assume that the loss function L is convex in its first argument and takes values in [0,1]. Then, for any η > 0 and any sequence y1, ..., yT ∈ Y, the regret for EWA algorithm is bounded as: RT ≤ log N η + ηT 8 In particular, for η =
- 8 log N/T, the regret is bounded as:
RT ≤
- (T/2) log N
Proof. Define the potential function Φt = log N
i=1 wt,i, t ∈ [1, T]
20 / 24
SLIDE 21
Exponential weighted average algorithm
Proof. We can prove that Φt+1 − Φt ≤ −ηL(ˆ yt, yt) + η2
8
⇒ Φ(T + 1) − Φ1 ≤ −η T
t=1 L(ˆ
yt, yt) + η2T
8
Then we try to obtain the lower bound of ΦT+1 − Φ1. ΦT+1 − Φ1 = log N
i=1 e−ηLT,i − log N
≥ log maxN
i=1 e−ηLT,i − log N
= −η minN
i=1 LT,i − log N
Combining lower bound and upper bound, we get
T
- t=1
L(ˆ yt, yt) −
N
min
i=1 LT,i ≤ log N
η + ηT 8
21 / 24
SLIDE 22
Exponential weighted average algorithm
The optimal choice of η requires knowledge of T, which is an disadvantage of this analysis. How to solve this? The doubling trick. Dividing time into periods [2k, 2k+1 − 1] of length 2k with k = 0, ..., n, and then choose ηk =
- 8 log N
2k
. This leads to the following theorem. Theorem Assume that the loss function L is convex in its first argument and takes values in [0, 1]. Then for any T ≤ 1 and any sequence y1, ..., yT ∈ Y, the regret of the EWA algorithm after T rounds is bounded as follows: RT ≤ √ 2 √ 2 − 1
- T
2 log N +
- log N
2
22 / 24
SLIDE 23
Summary
In many cases, data arrives sequentially while predictions are required on-the-fly Online algorithms do not require any distributional assumption Applicable in adversarial environments Simple algorithms Theoretical guarantees
23 / 24
SLIDE 24