Online Learning Fei Xia Language Technology Institute - - PowerPoint PPT Presentation

online learning
SMART_READER_LITE
LIVE PREVIEW

Online Learning Fei Xia Language Technology Institute - - PowerPoint PPT Presentation

Online Learning Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015 1 / 24 Outline Introduction Why online learning Basic stuff about online learning Prediction with expert advice Halving algorithm Weighted majority


slide-1
SLIDE 1

Online Learning

Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015

1 / 24

slide-2
SLIDE 2

Outline

Introduction

Why online learning Basic stuff about online learning

Prediction with expert advice

Halving algorithm Weighted majority algorithm Randomized weighted majority algorithm Exponential weighted average algorithm 2 / 24

slide-3
SLIDE 3

Why online learning?

In many cases, data arrives sequentially while predictions are required on-the-fly Online algorithms do not require any distributional assumption Applicable in adversarial environments Simple algorithms Theoretical guarantees

3 / 24

slide-4
SLIDE 4

Introduction

Basic Properties: Instead of learning from a training set and then testing on a test set, the online learning scenario mixes the training and test phases. Instead of assuming the distribution over data points is fixed both for training/test points and points are sampled in and i.i.d fashion, no distributional assumption is assumed in online learning. Instead of learning a hypothesis with small generalization error, online learning algorithms is measured using a mistake model and the regret.

4 / 24

slide-5
SLIDE 5

Introduction

Basic Setting: For t = 1, 2, ..., T Receive an instance xt ∈ X Make a prediction ˆ yt ∈ Y Receive true label yt ∈ Y Suffer loss L(ˆ yt, yt) Objective: min

T

  • t=1

L(ˆ yt, yt)

5 / 24

slide-6
SLIDE 6

Prediction with Expert Advice

For t = 1, 2, ..., T Receive an instance xt ∈ X Receive an advice yt,i ∈ Y, i ∈ [1, N] from N experts Make a prediction ˆ yt ∈ Y Receive true label yt ∈ Y Suffer loss L(ˆ yt, yt)

Figure : Weather forecast: an example of a prediction problem based on expert advice [Mohri et al., 2012] 6 / 24

slide-7
SLIDE 7

Regret Analysis

Objective: minimize the regret RT RT =

T

  • t=1

L(ˆ yt, yt) −

N

min

i=1 T

  • t=1

L(ˆ yt,i, yt) What does low regret mean? It means that we don’t lose much from not knowing future events It means that we can perform almost as well as someone who

  • bserves the entire sequence and picks the best prediction

strategy in hindsight It means that we can compete with changing environment

7 / 24

slide-8
SLIDE 8

Halving algorithm

Realizable case — After some number of rounds T, we will learn the concept and no longer make errors. Mistake bound — How many mistakes before we learn a particular concept? Maximum number of mistakes a learning algorithm A makes for a concept c: MA(c) = max

S

|mistakes(A, c)| Maximum number of mistakes a learning algorithm A makes for a concept class C: MA(C) = max

c∈C MA(c)

8 / 24

slide-9
SLIDE 9

Halving algorithm

Algorithm 1 HALVING(H)

1: H1 ← H 2: for t ← 1 to T do 3:

RECEIVE(xt)

4:

ˆ yt ← MAJORITYVOTE(Ht, xt)

5:

RECEIVE(yt)

6:

if ˆ yt = yt then

7:

Ht+1 ← {c ∈ Ht : c(xt) = yt}

8: return HT+1

9 / 24

slide-10
SLIDE 10

Halving algorithm

Theorem Let H be a finite hypothesis set, then MHalving(H) ≤ log2 |H| Proof. The algorithm makes predictions using majority vote from the active set. Thus, at each mistake, the active set is reduced by at least half. Hence, after log2 |H| mistakes, there can only remain

  • ne active hypothesis. Since we are in the realizable case, this

hypothesis must coincide with the target concept and we won’t make mistakes any more.

10 / 24

slide-11
SLIDE 11

Weighted majority algorithm

Algorithm 2 WEIGHTED-MAJORITY(N)

1: for i ← 1 to N do 2:

w1,i ← 1

3: for t ← 1 to T do 4:

RECEIVE(xt)

5:

if

i:yt,i=1 wt,i ≥ i:yt,i=0 wt,i then

6:

ˆ yt ← 1

7:

else

8:

ˆ yt ← 0

9:

RECEIVE(yt)

10:

if ˆ yt = yt then

11:

for i ← 1 to N do

12:

if yt,i = yt then

13:

wt+1,i ← βwt,i

14:

else wt+1,i ← wt,i

15: return wT +1

11 / 24

slide-12
SLIDE 12

Weighted majority algorithm

Theorem Fix β ∈ (0, 1). Let mT be the number of mistakes made by algorithm WM after T ≥ 1 rounds, and m∗

T be the number of

mistakes made by the best of the N experts. Then, the following inequality holds: mT ≤ log N + m∗

T log 1 β

log

2 1+β

Proof. Introduce a potential function Wt = N

1 wt,i, then derive its upper

and lower bound. Since the predictions are generated using weighted majority vote, if the algorithm makes an error at round t, we have Wt+1 ≤ 1 + β 2

  • Wt

12 / 24

slide-13
SLIDE 13

Weighted majority algorithm

Proof (Cont.) mT mistakes after T rounds and W1 = N, thus WT ≤ 1 + β 2 mT N Note that we also have WT ≥ wT,i = βmT,i where mT,i is the number of mistakes made by the ith expert. Thus, βm∗

T ≤

1 + β 2 mT N ⇒ mT ≤ log N + m∗

T log 1 β

log

2 1+β

13 / 24

slide-14
SLIDE 14

Weighted majority algorithm

mT ≤

log N+m∗

T log 1 β

log

2 1+β

mT ≤ O(log N) + constant × |mistakes of best expert| No assumption about the sequence of samples The number of mistakes is roughly a constant times that of the best expert in hindsight When m∗

T = 0, the bound reduces to mT ≤ O(log N), which

is the same as the Halving algorithm

14 / 24

slide-15
SLIDE 15

Randomized weighted majority algorithm

Drawback in weighted majority algorithm: zero-one loss; no deterministic algorithm can achieve a regret RT = o(T) In randomized scenario, A = {1, ..., N} of N actions is available At each round t ∈ [1, T], an online algorithm A selects a distribution pt over the set of actions Receive a loss vector lt, where lt,i ∈ 0, 1 is the loss associated with action i Define the expected loss for round t: Lt = N

i=1 pt,ilt,i; the

total loss for T rounds: LT = T

t=1 Lt

Define the total loss associated with action i: L = T

t=1 lt,i;

the minimal loss of a single action: Lmin

T

= mini∈A LT,i

15 / 24

slide-16
SLIDE 16

Randomized weighted majority algorithm

Algorithm 3 RANDOMIZED-WEIGHTED-MAJORITY(N)

1: for i ← 1 to N do 2:

w1,i ← 1

3:

p1,i ← 1/N

4: for t ← 1 to T do 5:

for i ← 1 to N do

6:

if lt,i = 1 then

7:

wt+1,i ← βwt,i

8:

else wt+1,i ← wt,i

9:

Wt+1 ← N

i=1 wt+1,i

10:

for i ← 1 to N do

11:

pt+1,i ← wt+1,i/Wt+1

12: return wT +1

Note: Let w0 be the total weight on outcome 0, w1 be the total weight

  • n outcome 1, W = w1 + w2; then the prediction strategy is to predict i

with probability wi/W. 16 / 24

slide-17
SLIDE 17

Randomized weighted majority algorithm

Theorem Fix β ∈ [1/2, 1]. Then for any T ≥ 1, the loss of algorithm RWM

  • n any sequence can be bounded as follows:

LT ≤ log N 1 − β + (2 − β)Lmin

T

In particular, for β = max{1/2, 1 −

  • (log N)/T}, the loss can be

bounded as: LT ≤ Lmin + 2

  • T log N

Proof. Define potential function Wt = N

i=1 wt,i, t ∈ [1, T].

17 / 24

slide-18
SLIDE 18

Proof (Cont.) Wt+1 =

i:lt,i=0 wt,i + β i:lt,i=1 wt,i

= Wt + (β − 1)Wt

  • i:lt,i=1 pt,i

= Wt(1 − (1 − β)Lt) ⇒ WT +1 = N T

t=1(1 − (1 − β)Lt)

Note that we also have WT +1 ≥ maxi∈[1,N] wT +1,i = βLmin

T

, thus, βLmin

T

≤ N T

t=1(1 − (1 − β)Lt)

⇒ Lmin

T

log β ≤ log N − (1 − β)LT ⇒ LT ≤ log N

1−β + (2 − β)Lmin T

Since Lmin

T

≤ T, this also implies LT ≤ log N 1 − β + (1 − β)T + Lmin

T

By minimizing the RHS w.r.t β, we get LT ≤ Lmin

T

+ 2

  • T log N ⇔ RT ≤ 2
  • T log N

18 / 24

slide-19
SLIDE 19

Exponential weighted average algorithm

We have extended WM algorithm to other loss functions L taking values in [0,1]. The EWA algorithm here is a further extension such that L is convex in its first argument. Algorithm 4 EXPONENTIAL-WEIGHTED-AVERAGE(N)

1: for i ← 1 to N do 2:

w1,i ← 1

3: for t ← 1 to T do 4:

RECEIVE(xt)

5:

ˆ yt ←

N

i=1 wt,iyt,i

N

i=1 wt,i

6:

RECEIVE(yt)

7:

for i ← 1 to N do

8:

wt+1,i ← wt,ie−ηL(ˆ

yt,i,yt)

9: return wT+1

19 / 24

slide-20
SLIDE 20

Exponential weighted average algorithm

Theorem Assume that the loss function L is convex in its first argument and takes values in [0,1]. Then, for any η > 0 and any sequence y1, ..., yT ∈ Y, the regret for EWA algorithm is bounded as: RT ≤ log N η + ηT 8 In particular, for η =

  • 8 log N/T, the regret is bounded as:

RT ≤

  • (T/2) log N

Proof. Define the potential function Φt = log N

i=1 wt,i, t ∈ [1, T]

20 / 24

slide-21
SLIDE 21

Exponential weighted average algorithm

Proof. We can prove that Φt+1 − Φt ≤ −ηL(ˆ yt, yt) + η2

8

⇒ Φ(T + 1) − Φ1 ≤ −η T

t=1 L(ˆ

yt, yt) + η2T

8

Then we try to obtain the lower bound of ΦT+1 − Φ1. ΦT+1 − Φ1 = log N

i=1 e−ηLT,i − log N

≥ log maxN

i=1 e−ηLT,i − log N

= −η minN

i=1 LT,i − log N

Combining lower bound and upper bound, we get

T

  • t=1

L(ˆ yt, yt) −

N

min

i=1 LT,i ≤ log N

η + ηT 8

21 / 24

slide-22
SLIDE 22

Exponential weighted average algorithm

The optimal choice of η requires knowledge of T, which is an disadvantage of this analysis. How to solve this? The doubling trick. Dividing time into periods [2k, 2k+1 − 1] of length 2k with k = 0, ..., n, and then choose ηk =

  • 8 log N

2k

. This leads to the following theorem. Theorem Assume that the loss function L is convex in its first argument and takes values in [0, 1]. Then for any T ≤ 1 and any sequence y1, ..., yT ∈ Y, the regret of the EWA algorithm after T rounds is bounded as follows: RT ≤ √ 2 √ 2 − 1

  • T

2 log N +

  • log N

2

22 / 24

slide-23
SLIDE 23

Summary

In many cases, data arrives sequentially while predictions are required on-the-fly Online algorithms do not require any distributional assumption Applicable in adversarial environments Simple algorithms Theoretical guarantees

23 / 24

slide-24
SLIDE 24

Reference I

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of machine learning. MIT press. Shalev-Shwartz, S. and Yoram, S. (2008). Tutorial on theory and applications of online learning. http://ttic.uchicago.edu/~shai/icml08tutorial/OLtutorial.pdf. [Online; accessed 15-Mar-2015].

24 / 24