Online Learning II Presenter: Adams Wei Yu Carnegie Mellon - - PowerPoint PPT Presentation

online learning ii
SMART_READER_LITE
LIVE PREVIEW

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon - - PowerPoint PPT Presentation

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams Wei Yu (CMU) March 2015 1 / 31 Recap of Online Learning The data comes sequentially. Do not need to assume the data distribution. Adversarial


slide-1
SLIDE 1

Online Learning II

Presenter: Adams Wei Yu

Carnegie Mellon University

Mar 2015

Presenter: Adams Wei Yu (CMU) March 2015 1 / 31

slide-2
SLIDE 2

Recap of Online Learning

The data comes sequentially. Do not need to assume the data distribution. Adversarial setting (worst case analysis). Regret minimization: RT =

T

  • t=1

L(ˆ yt, yt) − min

i∈{1,...,N} T

  • t=1

L(ˆ yt,i, yt) Several simple algorithms with theoretical guarantee (Halving, Weighted Majority, Randomized Weighted Majority, Exponential Weighted Average).

Presenter: Adams Wei Yu (CMU) March 2015 2 / 31

slide-3
SLIDE 3

Weighted Majority Algorithm

Algorithm 1 WEIGHTED-MAJORITY(N)

1: for i ← 1 to N do 2:

w1,i ← 1

3: for t ← 1 to T do 4:

RECEIVE(xt)

5:

if

i:yt,i=1 wt,i ≥ i:yt,i=0 wt,i then

6:

ˆ yt ← 1

7:

else

8:

ˆ yt ← 0

9:

RECEIVE(yt)

10:

if ˆ yt = yt then

11:

for i ← 1 to N do

12:

if yt,i = yt then

13:

wt+1,i ← βwt,i

14:

else wt+1,i ← wt,i

15: return wT+1

Presenter: Adams Wei Yu (CMU) March 2015 3 / 31

slide-4
SLIDE 4

Randomized weighted majority algorithm

Algorithm 2 RANDOMIZED-WEIGHTED-MAJORITY(N)

1: for i ← 1 to N do 2:

w1,i ← 1; p1,i ← 1/N

3: for t ← 1 to T do 4:

RECEIVE(xt)

5:

  • p1 =

i:yt,i =1 pt,i ;

p0 =

i:yt,i =0 pt,i

6:

Draw u ∼Uniform(0,1)

7:

if u < p1 then

8:

ˆ yt ← 1

9:

else

10:

ˆ yt ← 0

11:

for i ← 1 to N do

12:

if lt,i = 1 then

13:

wt+1,i ← βwt,i

14:

else wt+1,i ← wt,i

15:

Wt+1 ← N

i=1 wt+1,i

16:

for i ← 1 to N do

17:

pt+1,i ← wt+1,i /Wt+1

18: return wT+1

Presenter: Adams Wei Yu (CMU) March 2015 4 / 31

slide-5
SLIDE 5

Topics today

Perceptron algorithm and mistake bound. Winnow algorithm and mistake bound. Conversion from online to batch algorithm and analysis.

Presenter: Adams Wei Yu (CMU) March 2015 5 / 31

slide-6
SLIDE 6

Perceptron Algorithm

Algorithm 3 PERCEPTRON(w0)

1: for t ← 1 to T do 2:

RECEIVE(xt)

3:

ˆ yt ← sgn(wt · xt)

4:

RECEIVE(yt)

5:

if (ˆ yt = yt) then

6:

wt+1 ← wt + ytxt ⊲ More generally ηytxt

7:

else wt+1 ← wt

8: return wT+1

If xt is misclassified, then ytwt · xt is negative. After one iteration, ytwt+1 · xt = ytwt · xt + ηxt2

2, so the term ytwt · xt is corrected by

ηxt2

2.

Presenter: Adams Wei Yu (CMU) March 2015 6 / 31

slide-7
SLIDE 7

Another Point of View: Stochastic Gradient Descent

The Perceptron algorithm could be seen as finding the minimizer of an objective function F: F(w) = 1 T

T

  • t=1

max (0, −yt(w · xt)) = Ex∼ ˆ

D[

F(w, x)] where F(w, x) = max (0, −f (x)(w · x)) with f (x) being the label of x, and ˆ D is the empirical distribution of sample (x1, ..., xT). F(w) is convex over w.

Presenter: Adams Wei Yu (CMU) March 2015 7 / 31

slide-8
SLIDE 8

Another Point of View: Stochastic Gradient Descent

wt+1 ←

  • wt − η∇w

F(wt, xt), if F(w, xt) differentiable at wt wt,

  • therwise

Note that

  • F(w, xt) = max (0, −yt(w · xt))

∇w F(w, xt)

  • − ytxt

if yt(w · xt) < 0 if yt(w · xt) > 0 ⇓ wt+1 ←      wt + ηytxt, if yt(wt · xt) < 0 wt, if yt(wt · xt) > 0 wt,

  • therwise

Presenter: Adams Wei Yu (CMU) March 2015 8 / 31

slide-9
SLIDE 9

Upper Bound on the Number of Mistakes: Separable Case

Theorem 1

Let x1, ..., xT ∈ RN be a sequence of T points with xt ≤ r for all t ∈ [1, T], some r > 0. Assume that there exist ρ > 0 and v ∈ RN such that for all t ∈ [1, T], ρ ≤ yt(v·xt)

v1 . Then, the number of updates made by

the Perceptron algorithm when processing x1, ..., xT is bounded by r2/ρ2.

Presenter: Adams Wei Yu (CMU) March 2015 9 / 31

slide-10
SLIDE 10

Proof

I: the subset of the T rounds at which there is an update. M: the total number

  • f updates, i.e. |I| = M.

Mρ ≤v ·

t∈I ytxt

v (ρ ≤ yt(v · xt) v ) ≤

  • t∈I

ytxt (Cauchy-Schwarz inequality) =

  • t∈I

(wt+1 − wt) (definition of updates) =wT+1 (telescope sum, w0 = 0) =

  • t∈I

wt+12 − wt2 (telescope sum, w0 = 0) =

  • t∈I

wt + ytxt2 − wt2 (definition of updates) =

  • t∈I

2ytwt · xt + xt2 ≤

  • t∈I

xt2 ≤ √ Mr 2 ⇒ M ≤ r 2/ρ2

Presenter: Adams Wei Yu (CMU) March 2015 10 / 31

slide-11
SLIDE 11

Remarks

The Perceptron algorithm is simple. The bound of updates depends only on the margin ρ (we may assume r = 1) and is independent of the dimension N. This bound O( 1

ρ2 ) is tight for Perceptron Algorithm. Maybe very slow

when the ρ is small. We may need multiple pass of the data. It will go to deadloop if the data is not separable.

Presenter: Adams Wei Yu (CMU) March 2015 11 / 31

slide-12
SLIDE 12

Upper Bound on the Number of Mistakes: Inseparable Case

Theorem 2

Let x1, ..., xT ∈ RN be a sequence of T points with xt ≤ r for all t ∈ [1, T], some r > 0. Let ρ > 0 and v ∈ RN, v = 1. Define the deviation of xt by dt = max{0, ρ − yt(v · xt)} and let δ = T

t=1 d2 t .

Then, the number of updates made by the Perceptron algorithm when processing x1, ..., xT is bounded by (r + δ)2/ρ2. Key Idea: Construct data points in higher dimension which are separable and have the same prediction behavior as the one of original space.

Presenter: Adams Wei Yu (CMU) March 2015 12 / 31

slide-13
SLIDE 13

Proof

We first reduce the problem to the separable case by mapping the data points from xt ∈ RN to the higher dimension vector x′

t ∈ RN+T:

xt = (xt,1, ..., xt,N)T → x′

t = (xt,1, ..., xt,N, 0, ...,

  • (N+t)-th component

, ..., 0)T v = (v1, v2, ..., vN)T → v′ = [v1/Z, ..., vN/Z, y1d1/(∆Z), ..., yTdT/(∆Z)]T To make v = 1, we have Z =

  • 1 + δ2

∆2 .

Then the predictions made by Perceptron for x′

t, t ∈ [1, T] coincide with those

made in the original space for xt.

Presenter: Adams Wei Yu (CMU) March 2015 13 / 31

slide-14
SLIDE 14

Proof (con’t)

yt(v′ · x′

t) = yt(v · xt

Z + ∆ytdt Z∆ ) = ytv · xt Z + dt Z ≥ ytv · xt Z + ρ − yt(v · xt) Z = ρ Z So x′

1, ..., x′ T is linear separable with margin ρ/Z. Noting that x′ t2 ≤ r 2 + ∆2

and using the result in Theorem 1, we have that the number of updates made by the perceptron algorithm is bounded by (r 2 + ∆2)(1 + δ2/∆2) ρ2 Choosing ∆2 to minimize the bound leads to ∆2 = rδ and the bound is (r + δ)2 ρ2

Presenter: Adams Wei Yu (CMU) March 2015 14 / 31

slide-15
SLIDE 15

Dual Perceptron

For the original perceptron, we can write the separating hyperplane as w =

T

  • s=1

αsysxs where αs is incremented by one when this prediction does not match the correct label. Then we write the algorithm as: Algorithm 4 DUAL PERCEPTRON(α0)

1: α ← α0

⊲ typically α0 = 0

2: for t ← 1 to T do 3:

RECEIVE(xt)

4:

ˆ yt ← sgn(T

s=1 αsysxs · xt)

5:

RECEIVE(yt)

6:

if (ˆ yt = yt) then

7:

αt ← αt + 1

8:

else αt ← αt

9: return α

Presenter: Adams Wei Yu (CMU) March 2015 15 / 31

slide-16
SLIDE 16

Kernel Perceptron

Algorithm 5 KERNEL PERCEPTRON(α0)

1: α ← α0

⊲ typically α0 = 0

2: for t ← 1 to T do 3:

RECEIVE(xt)

4:

ˆ yt ← sgn(T

s=1 αsysK(xs, xt))

5:

RECEIVE(yt)

6:

if (ˆ yt = yt) then

7:

αt ← αt + 1

8:

else αt ← αt

9: return α

Any PDS kernel could be used.

Presenter: Adams Wei Yu (CMU) March 2015 16 / 31

slide-17
SLIDE 17

Winnow Algorithm

Algorithm 6 WINNOW(η)

1: w1 ← 1/N 2: for t ← 1 to T do 3:

RECEIVE(xt)

4:

ˆ yt ← sgn(wt · xt)

5:

RECEIVE(yt)

6:

if (ˆ yt = yt) then

7:

Zt ← N

i=1 wt,i exp(ηytxt,i)

8:

for i ← 1 to N do

9:

wt+1,i ← wt,i exp(ηytxt,i)

Zt

10:

else wt+1 ← wt

11: return wT+1

Presenter: Adams Wei Yu (CMU) March 2015 17 / 31

slide-18
SLIDE 18

Upper Bound on the Number of Mistakes: Separable Case

Theorem 3

Let x1, ..., xT ∈ RN be a sequence of T points with xt∞ ≤ r∞ for all t ∈ [1, T], some r∞ > 0. Assume that there exist ρ∞ > 0 and v ∈ RN such that for all t ∈ [1, T], ρ∞ ≤ yt(v·xt)

v . Then, for η = ρ∞ r2

∞ , the number

  • f updates made by the Winnow algorithm when processing x1, ..., xT is

upper bounded by 2(r2

∞/ρ2 ∞) log N.

Presenter: Adams Wei Yu (CMU) March 2015 18 / 31

slide-19
SLIDE 19

Proof

I: the subset of the T rounds at which there is an update. M: the total number

  • f updates, i.e. |I| = M. The potential function Φt is the relative entropy of the

distribution defined by the normalized weights vi/v1, i ∈ [1, N] and the one defined by the component of the weight vector wt,i: Φt =

N

  • i=1

vi v1 log vi/v1 wt,i

Presenter: Adams Wei Yu (CMU) March 2015 19 / 31

slide-20
SLIDE 20

Proof(con’t)

Φt+1 − Φt =

N

  • i=1

vi v1 log wt,i wt+1,i =

N

  • i=1

vi v1 log Zt exp(ηytxt,i) = log Zt − η

N

  • i=1

vi v1 ytxt,i ≤ log N

  • i=1

wt,i exp(ηytxt,i)

  • − ηρ∞

= log Ewt[exp(ηytxt)] − ηρ∞ ≤ log[exp(η2(2r∞)2/8)] − ηρ∞ =η2r 2

∞/2 − ηρ∞

Presenter: Adams Wei Yu (CMU) March 2015 20 / 31

slide-21
SLIDE 21

Proof(con’t)

Summing up this inequality we have ΦT+1 − Φ1 ≤ M(η2r 2

∞/2 − ηρ∞)

We also have the lower bound of Φ1: Φ1 =

N

  • i=1

vi v1 log vi/v1 1/N = log N +

N

  • i=1

vi v1 log vi v1 ≤ log N

And ΦT+1 − Φ1 ≥ 0 − log N = − log N So − log N ≤ M(η2r2

∞/2 − ηρ∞)

Setting η = ρ∞

r2

∞ yields the statement of the theorem. Presenter: Adams Wei Yu (CMU) March 2015 21 / 31

slide-22
SLIDE 22

Remarks

For both Perceptron and Winnow, the norm · p used for the input vector xt and the dual norm · q used for the separating hyperplane v (i.e. 1/p + 1/q = 1). Perceptron: p = q = 2, Winnow: p = ∞, q = 1. The bound of Winnow is favorable if a sparse set of experts can predict well. And the Perceptron is more favorable in the opposite situation. For example, if v = e1 = (1, 0, ..., 0) ∈ RN, xt ∈ {−1, 1}N, then the upper bound of Winnow is log N but that of of Perceptron is N.

Presenter: Adams Wei Yu (CMU) March 2015 22 / 31

slide-23
SLIDE 23

Online to Batch Conversion

What if we assume the data is drawn from some unknown distribution? Can these online algorithms be used to derive hypotheses with small generalization error in the standard stochastic setting?

Presenter: Adams Wei Yu (CMU) March 2015 23 / 31

slide-24
SLIDE 24

Basic Setting

Let S = ((x1, y1), ..., (xT, yT)) is drawn i.i.d. from some fixed but unknown distribution D. The sample is sequentially processed by an on-line learning algorithm A. It stats with an initial hypothesis hi ∈ H and generates a new hypothesis hi+1 ∈ H after processing pair (xi, yi), i ∈ [1, T]. The regret is defined as RT =

T

  • i=1

L(hi(xi), yi) − min

h∈H T

  • i=1

L(h(xi), yi) The generalization error of h ∈ H is its expected loss R(h) = E(x,y)∼D[L(h(x), y)]

Presenter: Adams Wei Yu (CMU) March 2015 24 / 31

slide-25
SLIDE 25

Theoretical results

Lemma 4

Let S = ((x1, y1), ..., (xT, yT)) is drawn i.i.d. from some fixed but unknown distribution D, L a loss bounded by M and h1, ..., hT the sequence of hypotheses generated by an online algorithm sequentially processing S. Then for any δ > 0, with probability at least 1 − δ, the following holds: 1 T

T

  • i=1

R(hi) ≤ 1 T

T

  • i=1

L(hi(xi), yi) + M

  • 2 log 1

δ

T

Presenter: Adams Wei Yu (CMU) March 2015 25 / 31

slide-26
SLIDE 26

Theoretical results

Theorem 5

Let S = ((x1, y1), ..., (xT, yT)) is drawn i.i.d. from some fixed but unknown distribution D, L a loss bounded by M and convex w.r.t. its first argument, and h1, ..., hT+1 the sequence of hypotheses generated by an online algorithm sequentially processing S. Then for any δ > 0, with probability at least 1 − δ, each of the following holds: R( 1 T

T

  • i=1

hi) ≤ 1 T

T

  • i=1

L(hi(xi), yi) + M

  • 2 log 1

δ

T R( 1 T

T

  • i=1

hi) ≤ inf

h∈H R(h) + RT

T + 2M

  • 2 log 2

δ

T

Presenter: Adams Wei Yu (CMU) March 2015 26 / 31

slide-27
SLIDE 27

Proof

By the convexity of L w.r.t. the first argument, we have L( 1 T

T

  • i=1

hi(xi), yi) ≤ 1 T

T

  • i=1

L(hi(xi), yi) Taking the expectation we have R( 1 T

T

  • i=1

hi) ≤ 1 T

T

  • i=1

R(hi) Then the first inequality immediately follows by the previous lemma.

Presenter: Adams Wei Yu (CMU) March 2015 27 / 31

slide-28
SLIDE 28

Proof(con’t)

By the definition of regret RT, for any δ > 0, the following holds with probability at least 1 − δ/2: R( 1 T

T

  • i=1

hi) ≤ 1 T

T

  • i=1

L(hi(xi), yi) + M

  • 2 log 2

δ

T ≤ min

h∈H

1 T

T

  • i=1

L(h(xi), yi) + RT T + M

  • 2 log 2

δ

T By definition of infh∈H R(h), for any ǫ > 0, there exists h∗ ∈ H with R(h∗) ≤ infh∈H R(h) + ǫ. By Hoeffding’s inequality, for any δ > 0, with probability at least 1 − δ/2, 1 T

T

  • i=1

L(h∗(xi), yi) ≤ R(h∗) + M

  • 2 log 2

δ

T

Presenter: Adams Wei Yu (CMU) March 2015 28 / 31

slide-29
SLIDE 29

Proof(con’t)

Thus for any ǫ > 0, by the union bound, the following holds with probability at least 1 − δ: R( 1 T

T

  • i=1

hi) ≤ 1 T

T

  • i=1

L(h∗(xi), yi) + RT T + M

  • 2 log 2

δ

T ≤R(h∗) + M

  • 2 log 2

δ

T + RT T + M

  • 2 log 2

δ

T =R(h∗) + RT T + 2M

  • 2 log 2

δ

T = inf

h∈H R(h) + ǫ + 2M

  • 2 log 2

δ

T Since ǫ > 0 is arbitrary, we have R( 1 T

T

  • i=1

hi) ≤ inf

h∈H R(h) + RT

T + 2M

  • 2 log 2

δ

T

Presenter: Adams Wei Yu (CMU) March 2015 29 / 31

slide-30
SLIDE 30

Application to Exponential Weighted Average Algorithm

Assuming the Loss function is bounded by M = 1. Remember that the regret bound of Exponential Weighted Average Algorithm is RT ≤

  • (T/2) log N

Substitute to the previous theorem we have R( 1 T

T

  • i=1

hi) ≤ inf

h∈H R(h) +

  • log N

2T + 2

  • 2 log 2

δ

T

Presenter: Adams Wei Yu (CMU) March 2015 30 / 31

slide-31
SLIDE 31

Summary

Perceptron algorithm and mistake bound. Winnow algorithm and mistake bound. Conversion from online to batch algorithm.

Presenter: Adams Wei Yu (CMU) March 2015 31 / 31