[PPT] - CS 6316 Machine Learning Boosting Yangfeng Ji Department of PowerPoint Presentation

SLIDE 1

CS 6316 Machine Learning

Boosting

Yangfeng Ji

Department of Computer Science University of Virginia

SLIDE 2

Overview

SLIDE 3

The Bias-Variance Decomposition

The expected error is decomposed as E

ǫ2

E

{h(x, S) − E [h(x, S)]}2
variance

+ {E [h(x, S)] − fD(x)}2

bias2

◮ bias: how far the expected prediction E [h(x, S)]

diverges from the optimal predictor fD(x)

◮ variance: how a hypothesis learned from a specific S

diverges from the average prediction E [h(x, S)]

2

SLIDE 4

Motivation

How can we reduce the overall error? E.g.,

◮ Reduce the bias

◮ Boosting: start with simple classifiers, and gradually

make a powerful one

◮ Reduce the variance

◮ Bagging: create multiple copies of data and train

classifiers on each of them, then combine them together

3

SLIDE 5

The Idea of Boosting

4

SLIDE 6

Weak Learnability

SLIDE 7

Weak Learnability

◮ A learning algorithm A is a γ-weak-learner for a

hypothesis space, if for the PAC learning condition, the algorithm returns a hypothesis h such that, with probability of at least 1 − δ, L(D, f )(h) ≤ 1 2 − γ (1)

◮ A hypothesis space His γ-weak-learnable if there

exists a γ-weak-learner for this class

6

SLIDE 8

Strong vs. Weak Learnability

◮ Strong learnability

L(D, f )(h) ≤ ǫ (2) where ǫ is arbitrarily small

◮ Weak learnability

L(D, f )(h) ≤ 1 2 − γ (3) where γ > 0. In other words, the error rate of weak learnability is slightly better than random guessing

7

SLIDE 9

Decision Stumps

◮ Let X Rd, the hypothesis space of decision stumps

is defined as H

DS {b · sign(x·,j − θ) : θ ∈ R, j ∈ [d]}

(4) with parameters θ ∈ R, j ∈ [d], and b ∈ {−1, +1}

8

SLIDE 10

Decision Stumps

◮ Let X Rd, the hypothesis space of decision stumps

is defined as H

DS {b · sign(x·,j − θ) : θ ∈ R, j ∈ [d]}

(4) with parameters θ ∈ R, j ∈ [d], and b ∈ {−1, +1}

◮ For each hθ,j,b ∈ H

DS with j 1 and b +1

hθ,1,+1(x)

+1

x·,1 > θ −1 x·,1 < θ (5)

x1 x2

8

SLIDE 11

Empirical Risk

◮ The empirical risk with a training set

S {(x1, y1), . . . , (xm, ym)} is defined as LD(hθ,j,b)

m

i1

Di · 1[hθ,j,b(xi) yi] (6) where 1[·] is the indicator function and 1[h(xi) yi] 1 when h(xi) yi is true

9

SLIDE 12

Empirical Risk

◮ The empirical risk with a training set

S {(x1, y1), . . . , (xm, ym)} is defined as LD(hθ,j,b)

m

i1

Di · 1[hθ,j,b(xi) yi] (6) where 1[·] is the indicator function and 1[h(xi) yi] 1 when h(xi) yi is true

◮ A special case with Di 1

m, then

LD(h) LS(h)

m

i1 1[h(xi) yi]

m (7)

9

SLIDE 13

Learning a Decision Stump

◮ For each j ∈ [d]

◮ Sort training examples, such that

x1,j ≤ x2,j ≤ · · · ≤ xm,j (8)

10

SLIDE 14

Learning a Decision Stump

◮ For each j ∈ [d]

◮ Sort training examples, such that

x1,j ≤ x2,j ≤ · · · ≤ xm,j (8)

◮ Define

Θj {

xi,j+xi+1,j 2

: i ∈ [m − 1]} ∪ {(x1,j − 1), (xm,j + 1)}

10

SLIDE 15

Learning a Decision Stump

◮ For each j ∈ [d]

◮ Sort training examples, such that

x1,j ≤ x2,j ≤ · · · ≤ xm,j (8)

◮ Define

Θj {

xi,j+xi+1,j 2

: i ∈ [m − 1]} ∪ {(x1,j − 1), (xm,j + 1)}

◮ Try each θ′ ∈ Θj and find the minimal risk with j

LD(hθ′,j,b)

m

i1

Di · 1[hθ′,j(xi) yi] (9)

10

SLIDE 16

Learning a Decision Stump

◮ For each j ∈ [d]

◮ Sort training examples, such that

x1,j ≤ x2,j ≤ · · · ≤ xm,j (8)

◮ Define

Θj {

xi,j+xi+1,j 2

: i ∈ [m − 1]} ∪ {(x1,j − 1), (xm,j + 1)}

◮ Try each θ′ ∈ Θj and find the minimal risk with j

LD(hθ′,j,b)

m

i1

Di · 1[hθ′,j(xi) yi] (9)

◮ Find the minimal risk for all j ∈ [d]

10

SLIDE 17

Example

Build a decision stump for the following classification task with the assumption that D (1 9, . . . , 1 9) (10)

x1 x2

11

SLIDE 18

Example

Build a decision stump for the following classification task with the assumption that D (1 9, . . . , 1 9) (10)

x1 x2

11

SLIDE 19

Example

Build a decision stump for the following classification task with the assumption that D (1 9, . . . , 1 9) (10)

x1 x2

11

SLIDE 20

Example

Build a decision stump for the following classification task with the assumption that D (1 9, . . . , 1 9) (10)

x1 x2

The best decision stump is x·,1 0.6

11

SLIDE 21

Boosting

SLIDE 22

Boosting

Q: Cen we boost a set of weak classifiers and make a strong classifier?

13

SLIDE 23

Boosting

Q: Cen we boost a set of weak classifiers and make a strong classifier? A: Yes. It looks like hS(x) sign(

T

t1

wtht(x)) (11)

13

SLIDE 24

Boosting

Q: Cen we boost a set of weak classifiers and make a strong classifier? A: Yes. It looks like hS(x) sign(

T

t1

wtht(x)) (11) Three questions

◮ How to find each weak classifier ht(x)? ◮ How to compute wt? ◮ How large the T is?

13

SLIDE 25

AdaBoost

1: Input: S {(x1, y1), . . . , (xm, ym))}, weak learner A,

number of rounds T

2: Initialize D(1) ( 1

m , . . . , 1 m)

3: for t 1, . . . , T do 8: end for 9: Output: the hypothesis hS(x) sign(T

t1 wtht(x))

14

SLIDE 26

AdaBoost

1: Input: S {(x1, y1), . . . , (xm, ym))}, weak learner A,

number of rounds T

2: Initialize D(1) ( 1

m , . . . , 1 m)

3: for t 1, . . . , T do 4:

Learn a weak classifier ht A(D(t), S)

8: end for 9: Output: the hypothesis hS(x) sign(T

t1 wtht(x))

14

SLIDE 27

AdaBoost

1: Input: S {(x1, y1), . . . , (xm, ym))}, weak learner A,

number of rounds T

2: Initialize D(1) ( 1

m , . . . , 1 m)

3: for t 1, . . . , T do 4:

Learn a weak classifier ht A(D(t), S)

5:

Compute error ǫt m

i1 D(t) i 1[ht(xi) yi]

8: end for 9: Output: the hypothesis hS(x) sign(T

t1 wtht(x))

14

SLIDE 28

AdaBoost

1: Input: S {(x1, y1), . . . , (xm, ym))}, weak learner A,

number of rounds T

2: Initialize D(1) ( 1

m , . . . , 1 m)

3: for t 1, . . . , T do 4:

Learn a weak classifier ht A(D(t), S)

5:

Compute error ǫt m

i1 D(t) i 1[ht(xi) yi]

6:

Let wt 1

2 log( 1 ǫt − 1)

8: end for 9: Output: the hypothesis hS(x) sign(T

t1 wtht(x))

14

SLIDE 29

AdaBoost

1: Input: S {(x1, y1), . . . , (xm, ym))}, weak learner A,

number of rounds T

2: Initialize D(1) ( 1

m , . . . , 1 m)

3: for t 1, . . . , T do 4:

Learn a weak classifier ht A(D(t), S)

5:

Compute error ǫt m

i1 D(t) i 1[ht(xi) yi]

6:

Let wt 1

2 log( 1 ǫt − 1)

7:

Update, for all i 1, . . . , m D(t+1)

i

D(t)

i

exp(−wt yiht(x))

m

j1 D(t) j exp(−wt yjht(xj))

8: end for 9: Output: the hypothesis hS(x) sign(T

t1 wtht(x))

14

SLIDE 30

Example

(a) t 1

[Mohri et al., 2018, Page 147]

15

SLIDE 31

Example

(a) t 1 (b) t 2

[Mohri et al., 2018, Page 147]

15

SLIDE 32

Example

(a) t 1 (b) t 2 (c) t 3

[Mohri et al., 2018, Page 147]

15

SLIDE 33

Example (Cont.)

sign(

T

t1

wtht(x)) h(x) (12) [Mohri et al., 2018, Page 147]

16

SLIDE 34

Theortical Analysis

Let S be a training set and assume that at each iteration of AdaBoost, the weak learner returns a hypothesis for which ǫt ≤ 1 2 − γ. [Shalev-Shwartz and Ben-David, 2014, Page 135 – 137]

17

SLIDE 35

Theortical Analysis

Let S be a training set and assume that at each iteration of AdaBoost, the weak learner returns a hypothesis for which ǫt ≤ 1 2 − γ. Then, the training error of the output hypothesis of AdaBoost is at most LS(hS) 1 m1[hS(xi) yi] ≤ exp(−2γ2T) (13) [Shalev-Shwartz and Ben-David, 2014, Page 135 – 137]

17

SLIDE 36

VC Dimension

Let

◮ B be a base hypothesis space (e.g., decision stumps) ◮ L(B, T) be the hypothesis space produced by the

AdaBoost algorithm [Shalev-Shwartz and Ben-David, 2014, Page 139]

18

SLIDE 37

VC Dimension

Let

◮ B be a base hypothesis space (e.g., decision stumps) ◮ L(B, T) be the hypothesis space produced by the

AdaBoost algorithm Assume that both T and VC-dim(B) are at least 3. Then, VC-dim(L(B, T)) ≤ O{T · VC-dim(B) · log(T · VC-dim(B))} [Shalev-Shwartz and Ben-David, 2014, Page 139]

18

SLIDE 38

Reference

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.

19