Decision Trees: Some exercises 1. Exemplifying how to compute - - PowerPoint PPT Presentation

decision trees some exercises
SMART_READER_LITE
LIVE PREVIEW

Decision Trees: Some exercises 1. Exemplifying how to compute - - PowerPoint PPT Presentation

0. Decision Trees: Some exercises 1. Exemplifying how to compute information gains and how to work with decision stumps CMU, 2013 fall, W. Cohen E. Xing, Sample questions, pr. 4 2. Timmy wants to know how to do well for ML exam. He collects


slide-1
SLIDE 1

Decision Trees: Some exercises

0.

slide-2
SLIDE 2

Exemplifying

how to compute information gains and how to work with decision stumps

CMU, 2013 fall, W. Cohen E. Xing, Sample questions, pr. 4

1.

slide-3
SLIDE 3

Timmy wants to know how to do well for ML exam. He collects those old statistics and decides to use decision trees to get his model. He now gets 9 data points, and two features: “whether stay up late before exam” (S) and “whether attending all the classes” (A). We already know the statistics is as below: Set (all ) = [5+, 4−] Set (S+) = [3+, 2−], Set(S−) = [2+, 2−] Set (A+) = [5+, 1−], Set(A−) = [0+, 3−] Suppose we are going to use the feature to gain most information at first split, which feature should we choose? How much is the information gain? You may use the following approximations: N 3 5 7 log2 N 1.58 2.32 2.81 2.

slide-4
SLIDE 4

[5+,4−] [3+,2−] [2+,2−] + − S [5+,4−] [5+,1−] [0+,3−] + − A H=1 H=0

H(all)

not.

= H[5+, 4−]

not.

= H 5 9

  • sim.

= H 4 9

  • def.

= 5 9 log2 9 5 + 4 9 log2 9 4 = log2 9 − 5 9 log2 5 − 4 9 log2 4 = 2 log2 3 − 5 9 log2 5 − 8 9 = −8 9 + 2 log2 3 − 5 9 log2 5 = 0.991076 H(all|S)

def.

= 5 9 · H[3+, 2−] + 4 9 · H[2+, 2−] = . . . = 5 9 · 0.970951 + 4 9 · 1 = 0.983861 H(all|A)

def.

= 6 9 · H[5+, 1−] + 3 9 · H[0+, 3−] = . . . = 6 9 · 0.650022 + 4 9 · 0 = 0.433348 IG(all, S)

def.

= H(all) − H(all|S) = 0.007215 IG(all, A)

def.

= H(all) − H(all|A) = 0.557728 IG(all, S) < IG(all, A) ⇔ H(all|S) > H(all|A) 3.

slide-5
SLIDE 5

Decision stumps; entropy, mean conditional entropy, and information gain: some very convenient formulas to be used when working with pocket calculators

Sebastian Ciobanu, Liviu Ciortuz, 2017

4.

slide-6
SLIDE 6

Consider the decision stump given in the nearby image. The symbols a, b, c, d, e and f represent counts computed from a training dataset (not provided). As you see, the label (or, output variable), here denoted by Y , is binary, and so is the attribute (or, input variable) A. Obviously, a = c + e and b = d + f.

1 [a+,b−] A [c+,d−] [e+,f−]

  • a. Prove that the entropy of [the output variable] corresponding to the par-

tition associated to the test node in this decision stump is H[a+, b−] = 1 a + b log2 (a + b)a+b aabb if a = 0 and b = 0. b. Derive a similar formula for the entropy, for the case when the output variable has three values, and the partition associated to the test node in the decision stump would be [a+, b−, c∗]. (Note that there is no link between the last c and the c count in the above decision stump.) 5.

slide-7
SLIDE 7
  • c. Assume that for the above given decision stump we would have [all counts]

c, d, e and f different from 0. Prove that the mean conditional entropy corre- sponding to this decision stump is Hnode |attribute = 1 a + b log2 (c + d)c+d ccdd · (e + f)e+f eef f

  • .
  • d. Now suppose that one of the counts c, d, e and f is 0; for example, let’s

consider c = 0. Infer the formula for the mean conditional entropy in this case. e. Prove the following formula for the information gain corresponding to the above given decision stump, assuming that all a, b, c, d, e and f are strictly positive: IGnode ;attribute = 1 a + b log2 (a + b)a+b aabb · ccdd (c + d)c+d · eef f (e + f)e+f

  • .

6.

slide-8
SLIDE 8

WARNING!

A serious problem when using the above formulas on a pocket calculator is the fact that the internal capacity of representation for intermediate results can be overflown. For example, a pocket calculator Sharp EL-531VH can represent the number 5656 but not 5757. Similarly, the calculator made available by the Linux Mint

  • perating system [see the Accessories menu] can represent 179179 but not

180180. In such overflow cases, you should use the basic / general formulas for en- tropies and the information gain, because they make a better use of the log function. 7.

slide-9
SLIDE 9

Answer

a. H[a+, b−] = −

  • a

a + b · log2 a a + b + b a + b · log2 b a + b

  • =

− 1 a + b

  • a · log2

a a + b + b · log2 b a + b

  • = −

1 a + b

  • log2
  • a

a + b a + log2

  • b

a + b)b = − 1 a + b

  • log2

aa (a + b)a + log2 bb (a + b)b

  • = −

1 a + b · log2 aa · bb (a + b)a+b = 1 a + b · log2 (a + b)a+b aa · bb . b. H[a+, b−, c∗] = −

  • a

a + b + c · log2 a a + b + c + b a + b + c · log2 b a + b + c + c a + b + c · log2 c a + b + c

  • =

− 1 a + b + c

  • log2
  • a

a + b + c a + log2

  • b

a + b + c b + log2

  • c

a + b + c c = − 1 a + b + c

  • log2

aa (a + b + c)a + log2 bb (a + b + c)b + log2 cc (a + b + c)c

  • =

− 1 a + b + c · log2 aa · bb · cc (a + b + c)a+b+c = 1 a + b + c · log2 (a + b + c)a+b+c aa · bb · cc . 8.

slide-10
SLIDE 10

c. Hnod|atribut

def.

= c + d a + b · H[c+, d−] + e + f a + b · H[e+, f−] = ✘✘ ✘ c + d a + b · 1 ✘✘ ✘ c + d · log2 (c + d)c+d cc · dd + ✘✘ ✘ e + f a + b · 1 ✘✘ ✘ e + f · log2 (e + f)e+f ee · f f = 1 a + b · log2 (c + d)c+d cc · dd · (e + f)e+f ee · f f

  • .

d. Hnod|atribut = e + f a + b · H[e+, f−] = e + f a + b ·

  • e

e + f · log2 e + f e + f e + f · log2 e + f f

  • =

✘✘ ✘ e + f a + b · 1 ✘✘ ✘ e + f

  • e · log2

e + f e + f · log2 e + f f

  • =

1 a + b · log2 (e + f)e+f ee · f f . e. IGnod;atribut = 1 a + b · log2 (a + b)a+b aa · bb − 1 a + b · log2 (c + d)c+d cc · dd · (e + f)e+f ee · f f

  • =

1 a + b · log2 (a + b)a+b aa · bb · cc · dd (c + d)c+d · ee · f f (e + f)e+f

  • .

9.

slide-11
SLIDE 11

Important REMARKS (in Romanian)

1. ˆ Intrucˆ at majoritatea calculatoarelor de buzunar nu au funct ¸ia log2 ci funct ¸iile ln ¸ si lg, ˆ ın formulele prezentate sau deduse la punctele a − e ar fi de dorit s˘ a schimb˘ am baza logaritmului. Aceasta revine – pe lˆ ang˘ a ˆ ınlocuirea lui log2 cu ln sau lg – la ˆ ınmult ¸irea membrului drept cu 1/ ln 2, respectiv 1/ lg 2. 2. ˆ Intrucˆ at, la aplicarea algoritmului ID3, pentru alegerea celui mai bun atribut de pus ˆ ın nodul curent este suficient s˘ a calcul˘ am entropiile condit ¸ionale medii, va fi suficient s˘ a compar˘ am produsele de forma (c + d)c+d ccdd · (e + f)e+f eef f (1) pentru compa¸ sii de decizie considerat ¸i la nodul respectiv ¸ si s˘ a alegem minimul dintre aceste produse. 10.

slide-12
SLIDE 12

Exemplifying the application of the ID3 algorithm

  • n a toy mushrooms dataset

CMU, 2002(?) spring, Andrew Moore, midterm example questions, pr. 2

11.

slide-13
SLIDE 13

You are stranded on a deserted island. Mushrooms of various types grow widely all

  • ver the island, but no other food is anywhere to be found. Some of the mushrooms

have been determined as poisonous and others as not (determined by your former companions’ trial and error). You are the only one remaining on the island. You have the following data to consider:

Example NotHeavy Smelly Spotted Smooth Edible A 1 1 B 1 1 1 C 1 1 1 D 1 E 1 1 1 F 1 1 1 G 1 1 H 1 U 1 1 1 ? V 1 1 1 ? W 1 1 ?

You know whether or not mushrooms A through H are poisonous, but you do not know about U through W. 12.

slide-14
SLIDE 14

For the a–d questions, consider only mushrooms A through H.

  • a. What is the entropy of Edible?
  • b. Which attribute should you choose as the root of a decision tree?

Hint: You can figure this out by looking at the data without explicitly computing the information gain of all four attributes.

  • c. What is the information gain of the attribute you chose in the previous

question?

  • d. Build a ID3 decision tree to classify mushrooms as poisonous or not.
  • e. Classify mushrooms U, V and W using the decision tree as poisonous or

not poisonous.

  • f. If the mushrooms A through H that you know are not poisonous suddenly

became scarce, should you consider trying U, V and W? Which one(s) and why? Or if none of them, then why not?

13.

slide-15
SLIDE 15

a. HEdible = H[3+, 5−]

def.

= −3 8 · log2 3 8 − 5 8 · log2 5 8 = 3 8 · log2 8 3 + 5 8 · log2 8 5 = 3 8 · 3 − 3 8 · log2 3 + 5 8 · 3 − 5 8 · log2 5 = 3 − 3 8 · log2 3 − 5 8 · log2 5 ≈ 0.9544

14.

slide-16
SLIDE 16

b.

1 [3+,5−] [1+,2−] [2+,3−] NotHeavy 1 [3+,5−] [2+,3−] [1+,2−] Smelly 1 [2+,3−] [1+,2−] Spotted [3+,5−] 1 [2+,2−] [1+,3−] Smooth [3+,5−]

Node 1 Node 2

15.

slide-17
SLIDE 17

c. H0/Smooth

def.

= 4 8H[2+, 2−] + 4 8H[1+, 3−] = 1 2 · 1 + 1 2 1 4 log2 4 1 + 3 4 log2 4 3

  • =

1 2 + 1 2 1 4 · 2 + 3 4 · 2 − 3 4 log2 3

  • = 1

2 + 1 2

  • 2 − 3

4 log2 3

  • =

1 2 + 1 − 3 8 log2 3 = 3 2 − 3 8 log2 3 ≈ 0.9056 IG0/Smooth

def.

= HEdible − H0/Smooth = 0.9544 − 0.9056 = 0.0488

16.

slide-18
SLIDE 18

d. H0/NotHeavy

def.

= 3 8H[1+, 2−] + 5 8H[2+, 3−] = 3 8 1 3 log2 3 1 + 2 3 log2 3 2

  • + 5

8 2 5 log2 5 2 + 3 5 log2 5 3

  • =

3 8 1 3 log2 3 + 2 3 log2 3 − 2 3 · 1

  • + 5

8 2 5 log2 5 − 2 5 · 1 + 3 5 log2 5 − 3 5 log2 3

  • =

3 8

  • log2 3 − 2

3

  • + 5

8

  • log2 5 − 3

5 log2 3 − 2 5

  • =

3 8 log2 3 − 2 8 + 5 8 log2 5 − 3 8 log2 3 − 2 8 = 5 8 log2 5 − 4 8 ≈ 0.9512 ⇒ IG0/NotHeavy

def.

= HEdible − H0/NotHeavy = 0.9544 − 0.9512 = 0.0032, IG0/NotHeavy = IG0/Smelly = IG0/Spotted = 0.0032 < IG0/Smooth = 0.0488

17.

slide-19
SLIDE 19

Important Remark (in Romanian)

ˆ In loc s˘ a fi calculat efectiv aceste cˆ a¸ stiguri de informat ¸ie, pentru a determina atributul cel mai ,,bun“, ar fi fost suficient s˘ a compar˘ am valorile entropiilor condit ¸ionale medii H0/Smooth ¸ si H0/NotHeavy: IG0/Smooth > IG0/NotHeavy ⇔ H0/Smooth < H0/NotHeavy ⇔ 3 2 − 3 8 log2 3 < 5 8 log2 5 − 1 2 ⇔ 12 − 3 log2 3 < 5 log2 5 − 4 ⇔ 16 < 5 log2 5 + 3 log2 3 ⇔ 16 < 11.6096 + 4.7548 (adev.) ˆ In mod alternativ, t ¸inˆ and cont de formulele de la problema UAIC, 2017 fall, S. Ciobanu,

  • L. Ciortuz, putem proceda chiar mai simplu relativ la calcule (nu doar aici, ori de cˆ

ate

  • ri nu avem de-a face cu un num˘

ar mare de instant ¸e): H0/Neted˘

a < H0/U¸ soar˘ a ⇔

44

  • 22 ·
  • 22 · 44

33 < 55

  • 22 ·
  • 33 ·
  • 33
  • 22 ⇔ 48

33 < 55 ⇔ 48 < 33 · 55 ⇔ 216 < 33 · 55 ⇔ 64 · 210

  • 1024

< 27 · 25 · 125

>3 · 8 · 125

1000

(adev.) 18.

slide-20
SLIDE 20

Node 1: Smooth = 0

1 [2+,2−] [2+,0−] [0+,2−] 1 1 Smelly 1 1 [2+,2−] [0+,1−] [2+,1−] NotHeavy 1 [2+,2−] [1+,1−] [1+,1−] 1 Spotted

19.

slide-21
SLIDE 21

Node 2: Smooth = 1

1 2 [1+,3−] [1+,1−] [0+,2−] Node 3 NotHeavy 2 [1+,3−] [1+,0−] [0+,3−] 1 1 Smelly 2 [1+,3−] [0+,1−] [1+,2−] 1 Spotted

20.

slide-22
SLIDE 22

The resulting ID3 Tree

1 1 1 [3+,5−] [2+,2−] [1+,3−] [2+,0−] [0+,2−] [0+,3−] [1+,0−] 1 1 Smelly Smelly Smooth

IF (Smooth = 0 AND Smelly = 0) OR (Smooth = 1 AND Smelly = 1) THEN Edibile; ELSE ¬Edible;

Classification of test instances: U Smooth = 1, Smelly = 1 ⇒ Edible = 1 V Smooth = 1, Smelly = 1 ⇒ Edible = 1 W Smooth = 0, Smelly = 1 ⇒ Edible = 0 21.

slide-23
SLIDE 23

Exemplifying the greedy character of the ID3 algorithm

CMU, 2003 fall, T. Mitchell, A. Moore, midterm, pr. 9.a

22.

slide-24
SLIDE 24

Fie atributele binare de intrare A, B, C, atributul de ie¸ sire Y ¸ si urm˘ atoarele exemple de antrenament: A B C Y 1 1 1 1 1 1 1 1 1

  • a. Determinat

¸i arborele de decizie calculat de algoritmul ID3. Este acest arbore de decizie consistent cu datele de antrenament?

23.

slide-25
SLIDE 25

R˘ aspuns Nodul 0: (r˘ ad˘ acina)

[2+,2−] [1+,1−] [1+,1−] 1 A [2+,2−] [1+,1−] [1+,1−] 1 B Nod 1 [2+,2−] [0+,1−] [2+,1−] 1 C

Se observ˘ a imediat c˘ a primii doi “compa¸ si de decizie” (engl. decision stumps) au IG = 0, ˆ ın timp ce al treilea compas de decizie are IG > 0. Prin urmare, ˆ ın nodul 0 (r˘ ad˘ acin˘ a) vom pune atributul C.

24.

slide-26
SLIDE 26

Nodul 1: Avem de clasificat instant ¸ele cu C = 1, deci alegerea se face ˆ ıntre atributele A ¸ si B.

Nod 2 1 [1+,1−] [1+,0−] 1 A [2+,1−] 1 [1+,1−] [1+,0−] 1 B [2+,1−]

Cele dou˘ a entropii condit ¸ionale medii sunt egale: H1/A = H1/B = 2 3H[1+, 1−] + 1 3H[1+, 0−] A¸ sadar, putem alege oricare dintre cele dou˘ a atribute. Pentru fixare, ˆ ıl alegem pe A.

25.

slide-27
SLIDE 27

Nodul 2: La acest nod nu mai avem disponibil decˆ at atributul B, deci ˆ ıl vom pune pe acesta. Arborele ID3 complet este reprezentat ˆ ın figura al˘ aturat˘ a. Prin construct ¸ie, algoritmul ID3 este consistent cu datele de antrenament dac˘ a acestea sunt con- sistente (i.e., necontradictorii). ˆ In cazul nostru, se verific˘ a imediat c˘ a datele de antrenament sunt consistente.

1 1 [1+,0−] [0+,1−] 1 B [2+,1−] C 1 [0+,1−] [1+,1−] [1+,0−] 1 A [2+,2−] 26.

slide-28
SLIDE 28
  • b. Exist˘

a un arbore de decizie de adˆ ancime mai mic˘ a (decˆ at cea a arborelui ID3) consistent cu datele de mai sus? Dac˘ a da, ce concept (logic) reprezint˘ a acest arbore? R˘ aspuns: Din date se observ˘ a c˘ a atributul de ie¸ sire Y reprezint˘ a de fapt funct ¸ia logic˘ a A xor B. Reprezentˆ and aceast˘ a funct ¸ie ca arbore de decizie, vom obt ¸ine arborele al˘ aturat. Acest arbore are cu un nivel mai put ¸in decˆ at arborele construit cu algoritmul ID3. Prin urmare, arborele ID3 nu este op- tim din punctul de vedere al num˘ arului de niveluri.

1 1 1 1 1 [2+,2−] [1+,1−] [1+,1−] A B B [0+,1−] [1+,0−] [1+,0−] [0+,1−] 27.

slide-29
SLIDE 29

Aceasta este o consecint ¸˘ a a caracterului “greedy” al algoritmului ID3, datorat faptului c˘ a la fiecare iterat ¸ie alegem ,,cel mai bun“ atribut ˆ ın raport cu criteriul cˆ a¸ stigului de informat ¸ie. Se ¸ stie c˘ a algoritmii de tip “greedy” nu granteaz˘ a obt ¸inerea opti- mului global.

28.

slide-30
SLIDE 30

Exemplifying the application of the ID3 algorithm in the presence of both categorical and continue attributes

CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 1.1

29.

slide-31
SLIDE 31

As of September 2012, 800 extrasolar planets have been identified in our galaxy. Super- secret surveying spaceships sent to all these planets have established whether they are habitable for humans or not, but sending a spaceship to each planet is expensive. In this problem, you will come up with decision trees to predict if a planet is habitable based only on features observable using telescopes. a. In nearby table you are given the data from all 800 planets surveyed so far. The fea- tures observed by telescope are Size (“Big” or “Small”), and Orbit (“Near” or “Far”). Each row indicates the values of the features and habitability, and how many times that set of values was observed. So, for example, there were 20 “Big” planets “Near” their star that were habitable. Size Orbit Habitable Count Big Near Yes 20 Big Far Yes 170 Small Near Yes 139 Small Far Yes 45 Big Near No 130 Big Far No 30 Small Near No 11 Small Far No 255 Derive and draw the decision tree learned by ID3 on this data (use the maximum information gain criterion for splits, don’t do any pruning). Make sure to clearly mark at each node what attribute you are splitting on, and which value corresponds to which

  • branch. By each leaf node of the tree, write in the number of habitable and inhabitable

planets in the training data that belong to that node. 30.

slide-32
SLIDE 32

Answer: Level 1

[374+,426−] S B Size [184+,266−] [190+,160−] F N Orbit [215+,285−] [374+,426−] H(374/800) H(374/800) H(92/225) [159+,141−] H(19/35) H(47/100) H(43/100)

H(Habitable|Size) = 35 80 · H 19 35

  • + 45

80 · H 92 225

  • =

35 80 · 0.9946 + 45 80 · 0.9759 = 0.9841 H(Habitable|Orbit) = 3 8 · H 47 100

  • + 5

8 · H 43 100

  • =

3 8 · 0.9974 + 5 8 · 0.9858 = 0.9901 IG(Habitable; Size) = 0.0128 IG(Habitable; Orbit) = 0.0067

31.

slide-33
SLIDE 33

The final decision tree

+ − − + F N [170+,30−] [374+,426−] S B F N [139+,11−] [45+,255−] [184+,266−] [190+,160−] [20+,130−] Size Orbit Orbit

32.

slide-34
SLIDE 34

b. For just 9 of the planets, a third feature, Temperature (in Kelvin degrees), has been measured, as shown in the nearby table. Redo all the steps from part a on this data using all three features. For the Temper- ature feature, in each iteration you must maximize over all possible binary thresh-

  • lding splits (such as T ≤ 250 vs. T > 250,

for example). Size Orbit Temperature Habitable Big Far 205 No Big Near 205 No Big Near 260 Yes Big Near 380 Yes Small Far 205 No Small Far 260 Yes Small Near 260 Yes Small Near 380 No Small Near 380 No According to your decision tree, would a planet with the features (Big, Near, 280) be predicted to be habitable or not habitable? Hint: You might need to use the following values of the entropy function for a Bernoulli variable of parameter p: H(1/3) = 0.9182, H(2/5) = 0.9709, H(92/225) = 0.9759, H(43/100) = 0.9858, H(16/35) = 0.9946, H(47/100) = 0.9974. 33.

slide-35
SLIDE 35

Answer Binary threshold splits for the continuous attribute Temperature:

205 380 260 232.5 320 Temperature

34.

slide-36
SLIDE 36

Answer: Level 1

− H=0 H(1/3) T<=320 [4+,5−] H(4/9) [1+,2−] N [4+,5−] [4+,5−] S B [2+,2−] H=1 H(4/9) [1+,2−] F N H(1/3) H(4/9) Orbit Size [2+,3−] H(2/5) [3+,3−] H=1 [3+,3−] H=1 [4+,5−] H(4/9) [0+,3−] [4+,2−] H(1/3) N Y T<=232.5 Y = = > > > >

H(Habitable|Size) = 4 9 + 5 9 · H 2 5

  • = 4

9 + 5 9 · 0.9709 = 0.9838 H(Habitable|Temp ≤ 232.5) = 2 3 · H 1 3

  • = 2

3 · 0.9182 = 0.6121 IG(Habitable; Size) = H 187 400

  • − 0.9838

= 0.9969 − 0.9838 = 0.0072 IG(Habitable; Temp ≤ 232.5) = 0.3788

35.

slide-37
SLIDE 37

Answer: Level 2

+ H=0 + H=0 + H=0 [4+,2−] S B [2+,0−] F Orbit Size [2+,2−] H=1 [4+,2−] H(1/3) [1+,0−] N H(2/5) [3+,2−] H(1/3) T<=320 [4+,2−] H(1/3) [3+,0−] [1+,2−] N Y H(1/3) > > > >= Note:

The plain lines indicate that both the specific conditional entropies and their coefficients (weights) in the average conditional entropies satisfy the indicated relationship. (For ex- ample, H(2/5) > H(1/3) and 5/6 > 3/6.) The dotted lines indicate that only the specific conditional entropies satisfy the indicated rela-

  • tionship. (For example, H(2/2) = 1 > H(2/5) but 4/6 < 5/6.)

36.

slide-38
SLIDE 38

The final decision tree:

− + + − [0+,3−] N Y [4+,2−] T<=232.5 [4+,5−] [3+,0−] [1+,2−] N Y T<=320 [1+,0−] S B [0+,2−] Size

  • c. According to your decision tree,

would a planet with the features (Big, Near, 280) be predicted to be habitable or not habitable? Answer: habitable.

37.

slide-39
SLIDE 39

Exemplifying the application of the ID3 algorithm

  • n continuous attributes,

and in the presence of a noise. Decision surfaces; decision boundaries. The computation of the LOOCV error CMU, 2002 fall, Andrew Moore, midterm, pr. 3

38.

slide-40
SLIDE 40

Suppose we are learning a classifier with binary input values Y = 0 and Y = 1. There is one real-valued input X. The training data is given in the nearby table. Assume that we learn a decision tree on this data. Assume that when the decision tree splits on the real-valued attribute X, it puts the split threshold halfway between the attributes that surround the split. For example, using information gain as splitting criterion, the decision tree would initially choos to split at X = 5, which is halfway between X = 4 and X = 6 datapoints. X Y 1 2 3 4 6 1 7 1 8 1 8.5 9 1 10 1 Let Algorithm DT2 be the method of learning a decision tree with only two leaf nodes (i.e., only one split). Let Algorithm DT⋆ be the method of learning a decision tree fully, with no prunning.

  • a. What will be the training set error for DT2 and respectively DT⋆ on our data?
  • b. What will be the leave-one-out cross-validation error (LOOCV) for DT2 and re-

spectively DT⋆ on our data? 39.

slide-41
SLIDE 41
  • training data:

1 2 3 4 5 7 8 9 10 X 6

  • discretization / decision thresholds:

X 1 2 3 4 6 7 8 9 10 5 8.25 8.75

  • compact representation of the ID3 tree:

2 1

X 1 2 3 4 5 6 7 8 9 10

  • decision “surfaces”:

X 8.25 8.75

+

+

5

ID3 tree:

X<5 X<8.75

1 1

X<8.25

1 2

[4−,0+] [1−,5+] [5−,5+] [1−,0+] [0−,2+] [1−,2+] [0−,3+] Da Da Nu Nu Nu Da 40.

slide-42
SLIDE 42

ID3: IG computations

X<8.75 X<8.25 [1−,2+] [1−,3+] Nu [0−,2+] Da [1−,5+] [0−,3+] Nu Da [1−,5+] IG: 0.109 IG: 0.191 Level 1:

5 8.75 8.25

+ +

− −

Decision "surfaces": X<8.75 [5−,3+] Nu [0−,2+] Da [5−,5+] X<8.25 [1−,2+] [4−,3+] Nu Da [5−,5+] X<5 [4−,0+] [1−,5+] Da Nu [5−,5+] < < Level 0: = < 41.

slide-43
SLIDE 43

ID3, LOOCV: Decision surfaces LOOCV error: 3/10

8.75 8.25

+ +

− −

4.5 X=4: 8.75 8.25

+ +

− −

5.5 X=6: 5 8.75 8.25

+ +

− −

5 8.75

+ +

7.75

X=8: 5

+ +

− +

X=8.5: 5 8.25

+ +

− −

9.25 X=9: X=1,2,3,7,10:

42.

slide-44
SLIDE 44

DT2

+

5

Decision "surfaces": X<5

1

[1−,5+] Da Nu [5−,5+] [4−,0+]

43.

slide-45
SLIDE 45

DT2, LOOCV IG computations

Case 1: X=1, 2, 3, 4 X<5 X<8.75 X<8.25 [3−,0+] [1−,5+] Da Nu [4−,5+] < < [1−,2+] [4−,3+] Nu [0−,2+] Da [4−,5+] [3−,3+] Nu Da [4−,5+] /4.5 = < Case 2: X=6, 7, 8 X<5 X<8.75 X<8.25 [4−,0+] [1−,4+] Da Nu < < [5−,2+] Nu [0−,2+] Da [5−,4+] [4−,2+] Nu Da [5−,4+] /5.5 /7.75 [5−,4+] [1−,2+] = < 44.

slide-46
SLIDE 46

DT2, CVLOO IG computations (cont’d)

Case 3: X=8.5 X<5 [4−,0+] [0−,5+] Da Nu [4−,5+] Case 2: X=9, 10 X<5 X<8.75 X<8.25 < [1−,4+] Da Nu < [5−,3+] Nu [0−,1+] Da [5−,4+] [4−,3+] Nu Da [5−,4+] [5−,4+] [1−,1+] /9.25 [4−,0+] = < CVLOO error: 1/10 45.

slide-47
SLIDE 47

Applying ID3 on a dataset with two continuous attributes: decision zones

Liviu Ciortuz, 2017

46.

slide-48
SLIDE 48

Consider the training dataset in the nearby figure. X1 and X2 are considered countinous at- tributes. Apply the ID3 algorithm on this dataset. Draw the resulting decision tree. Make a graphical representation of the decision areas and decision boundaries determined by ID3.

X 1 X 2

1 3 2 4 1 2 3 4 5

47.

slide-49
SLIDE 49

Solution

Level 1:

− H=0 − H=0 [4+,5−] [4+,5−] N Y [2+,0−] [2+,4−] Y N H(1/3) X1 < 9/2 [4+,5−] N Y X2 < 3/2 X1 < 5/2 H(1/4) [4+,5−] N Y [4+,5−] N Y X2 < 5/2 X2 < 7/2 [2+,5−] [2+,1−] [1+,1−] [3+,4−] [3+,2−] [1+,3−] [4+,2−] [0+,3−] H=1 H(2/5) H(3/7) H(2/7) H(1/3) = > H(1/3) < > IG=0.091 IG=0.378 IG=0.319

H[Y| . ] = 2/3 H(1/3) H[Y| . ] = 7/9 H(2/7)

H[Y| . ] = 5/9 H(2/5) + 4/9 H(1/4)

48.

slide-50
SLIDE 50

Level 2:

− H=0 − H=0 − H=0 N X1 < 5/2 [1+,1−] N Y Y [4+,2−] [2+,0−] [2+,2−] H=1 X1 < 4 X2 < 3/2 X2 < 5/2 [4+,2−] [4+,2−] [4+,2−] Y N Y [2+,2−] H=1 [2+,0−] = [3+,1−] H=1 H(1/4) [3+,2−] [1+,0−] H(2/5) IG=0.04 IG=0.109 > IG=0.251 > = N IG=0.251

H[Y| . ] = 1/3 + 2/3 H(1/4)

H[Y| . ] = 5/6 H(2/5) H[Y| . ] = 2/3

Notes:

1. Split thresholds for continuous attributes must be recomputed at each new iteration, because they may change. (For instance, here above, 4 replaces 4.5 as a threshold for X1.)

  • 2. In the current stage, i.e., for the current node in the ID3 tree you may choose (as test)

either X1 < 5/2 or X1 < 4.

  • 3. Here above we have an example of reverse relationships between weighted and respectively

un-weighted specific entropies: H[2+, 2−] > H[3+, 2−] but 4 6 · H[2+, 2−] < 5 6 · H[3+, 2−].

49.

slide-51
SLIDE 51

The final decision tree:

− [0+,3−] + − + N Y X2 < 7/2 [4+,5−] [4+,2−] [2+,0−] [2+,2−] N Y X1 < 5/2 N Y X1 < 4 [0+,2−] [2+,0−]

Decision areas:

+ 2

X X 1

1 1 2 3 3 2 4 4 5 +

50.

slide-52
SLIDE 52

Other criteria than IG for the best attribute selection in ID3:

Gini impurity / index and Misclassification impurity

CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 4

51.

slide-53
SLIDE 53

Entropy is a natural measure to quantify the impurity of a data set. The Decision Tree learning algorithm uses entropy as a splitting criterion by cal- culating the information gain to decide the next attribute to partition the current node. However, there are other impurity measures that could be used as the split- ting criteria too. Let’s investigate two of them. Assume the current node n has k classes c1, c2, . . . , ck.

  • Gini Impurity: i(n) = 1 − k

i=1 P 2(ci).

  • Misclassification Impurity: i(n) = 1 − maxk

i=1 P(ci).

  • a. Assume node n has two classes, c1 and c2. Please draw a figure in which

the three impurity measures (Entropy, Gini and Misclassification) are repre- sented as the function of P(c1). 52.

slide-54
SLIDE 54

Answer

Entropy (p) = −p log2 p − (1 − p) log2(1 − p) Gini (p) = 1 − p2 − (1 − p)2 = 2p(1 − p) MisClassif (p) = =

  • 1 − (1 − p),

for p ∈ [0; 1/2) 1 − p, for p ∈ [1/2; 1] = p, for p ∈ [0; 1/2) 1 − p, for p ∈ [1/2; 1]

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

p

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Entropy Gini MisClassif

53.

slide-55
SLIDE 55
  • b. Now we can define new splitting criteria based on the Gini and Misclassifi-

cation impurities, which is called Drop-of-Impurity in some literatures. That is the difference between the impurity of the current node and the weighted sum of the impurities of children. For the binary category splits, Drop-of-Impurity is defined as ∆i(n) = i(n) − P(nl) i(nl) − P(nr) i(nr), where nl and nr are the left and respectively the right child of node n after splitting. Please calculate the Drop-of-Impurity (using both Gini and Misclassification Impurity) for the following example data set in which C is the class variable to be predicted. A a1 a1 a1 a2 a2 a2 C c1 c1 c2 c2 c2 c2 54.

slide-56
SLIDE 56

Answer

2 1

A

1

a a2 [2+,4−] [2+,1−] [0+,3−]

Gini: p = 2/6 = 1/3 ⇒ i(0) = 2 · 1 3(1 − 1 3) = 2 3 · 2 3 = 4 9 i(1) = 2 · 2 3(1 − 2 3) = 4 3 · 1 3 = 4 9 i(2) =            ⇒ ∆ i(0) = 4 9 − 3 6 · 4 9 = 4 9 − 2 9 = 2 9. Misclassification: p = 1/3 < 1/2 ⇒ i(0) = p = 1 3 i(1) = 1 − 2 3 = 1 3 i(2) =            ⇒ ∆ i(0) = 1 3 − 1 2 · 1 3 = 1 6. 55.

slide-57
SLIDE 57
  • c. We choose the attribute that can maximize the Drop-of-Impurity to split a
  • node. Please create a data set and show that on this data set, Misclassification

Impurity based ∆i(n) couldn’t determine which attribute should be used for splitting (e.g., ∆i(n) = 0 for all the attributes), but Information Gain and Gini Impurity based ∆i(n) can.

Answer

A a1 a1 a1 a2 a2 a2 a2 C c1 c2 c2 c2 c2 c2 c1 Entropy: ∆ i(0) = H[5+, 2−] − 3 7H[2+, 1−] + 4 7H[3+, 1−]

  • = 0.006 = 0;

Gini: 2 2 7

  • 1 − 2

7

3 7 · 1 3

  • 1 − 1

3

  • + 4

7 · 1 4

  • 1 − 1

4

  • = 2

10 49 − 2 21 + 3 28

  • = 2

10 49 − 17 84

  • = 0;

Misclassification: ∆ i(0) = 2 7 − 3 7 · 1 3 + 4 7 · 1 4

  • = 0.

56.

slide-58
SLIDE 58

Note: A [quite bad] property

If C1 < C2, Cl

1 < Cl 2 and Cr 1 < Cr 2

(with C1 = Cl

1 + Cr 1 and C2 = Cl 2 + Cr 2),

then the Drop-of-Impurity based on Misclassification is 0.

A a1 a2 [C +,C −]

l 2

[C +,C −]

l 2 r r l 2 l l

[C +,C −]

Proof

∆i(n) = C1 C1 + C2 − Cl

1 + Cl 2

C1 + C2 · Cl

1

Cl

1 + Cl 2

+ Cr

1 + Cr 2

C1 + C2 · Cr

1

Cr

1 + Cr 2

  • =

C1 C1 + C2 − Cl

1 + Cr 1

C1 + C2 = C1 C1 + C2 − C1 C1 + C2 = 0. 57.

slide-59
SLIDE 59

Exemplifying pre- and post-pruning of decision trees using a threshold for the Information Gain

CMU, 2006 spring, Carlos Guestrin, midterm, pr. 4 [adapted by Liviu Ciortuz]

58.

slide-60
SLIDE 60

Starting from the data in the following table, the ID3 algorithm builds the decision tree shown nearby. V W X Y 1 1 1 1 1 1 1 1 1 1

V X 1 W 1 1 W 1 1 1 1

a. One idea for pruning such a decision tree would be to start at the root, and prune splits for which the information gain (or some other criterion) is less than some small ε. This is called top- down pruning. What is the decision tree returned for ε = 0.0001? What is the training set error for this tree?

59.

slide-61
SLIDE 61

Answer

We will first augment the given decision tree with informations regarding the data partitions (i.e., the number of positive and, respectively, negative in- stances) which were assigned to each test node during the application of ID3 algorithm. The information gain yielded by the attribute X in the root node is: H[3+; 2−] − 1/5 · 0 − 4/5 · 1 = 0.971 − 0.8 = 0.171 > ε. Therefore, this node will not be eliminated from the tree. The information gain for the attribute V (in the left- hand side child of the root node) is: H[2+; 2−] − 1/2 · 1 − 1/2 · 1 = 1 − 1 = 0 < ε.

X V 1 W W 1 1 [3+;2−] [2+;2−] [1+;0−] 1 1 [1+;1−] [1+;1−] 1 1 [0+;1−] [1+;0−] [1+;0−] [0+;1−]

So, the whole left subtree will be cut off and replaced by a decision node, as shown nearby. The training error produced by this tree is 2/5.

X 1 1

60.

slide-62
SLIDE 62

b. Another option would be to start at the leaves, and prune subtrees for which the information gain (or some other criterion)

  • f a split is less than some small ε. In this method, no ancestors of

children with high information gain will get pruned. This is called bottom-up pruning. What is the tree returned for ε = 0.0001? What is the training set error for this tree?

Answer:

The information gain of V is IG(Y ; V ) = 0. A step later, the infor- mation gain of W (for either one of the descendent nodes of V ) is IG(Y ; W) = 1. So bottom-up pruning won’t delete any nodes and the tree [given in the problem statement] remains unchanged. The training error is 0.

61.

slide-63
SLIDE 63

c. Discuss when you would want to choose bottom-up pruning

  • ver top-down pruning and vice versa.

Answer:

Top-down pruning is computationally cheaper. When building the tree we can determine when to stop (no need for real pruning). But as we saw top-down pruning prunes too much. On the other hand, bottom-up pruning is more expensive since we have to first build a full tree — which can be exponentially large — and then apply pruning. The second problem with bottom-up pruning is that supperfluous attributes may fullish it (see CMU, CMU, 2009 fall, Carlos Guestrin, HW1, pr. 2.4). The third prob- lem with it is that in the lower levels of the tree the number of examples in the subtree gets smaller so information gain might be an inappropriate criterion for pruning, so one would usually use a statistical test instead.

62.

slide-64
SLIDE 64

Exemplifying χ2-Based Pruning of Decision Trees

CMU, 2010 fall, Ziv Bar-Joseph, HW2, pr. 2.1

63.

slide-65
SLIDE 65

In class, we learned a decision tree pruning algorithm that iter- atively visited subtrees and used a validation dataset to decide whether to remove the subtree. However, sometimes it is desir- able to prune the tree after training on all of the available data. One such approach is based on statistical hypothesis testing. After learning the tree, we visit each internal node and test whether the attribute split at that node is actually uncorrelated with the class labels. We hypothesize that the attribute is independent and then use Pearson’s chi-square test to generate a test statistic that may provide evidence that we should reject this “null” hypothesis. If we fail to reject the hypothesis, we prune the subtree at that node.

64.

slide-66
SLIDE 66
  • a. At each internal node we can create a contingency table for the training

examples that pass through that node on their paths to the leaves. The table will have the c class labels associated with the columns and the r values the split attribute associated with the rows. Each entry Oi,j in the table is the number of times we observe a training sample with that attribute value and label, where i is the row index that corresponds to an attribute value and j is the column index that corresponds to a class label. In order to calculate the chi-square test statistic, we need a similar table of expected counts. The expected count is the number of observations we would expect if the class and attribute are independent. Derive a formula for each expected count Ei,j in the table. Hint: What is the probability that a training example that passes through the node has a particular label? Using this probability and the independence assumption, what can you say about how many examples with a specific attribute value are expected to also have the class label? 65.

slide-67
SLIDE 67
  • b. Given these two tables for the split, you can now calculate the chi-square

test statistic χ2 =

r

  • i=1

c

  • j=1

(Oi,j − Ei,j)2 Ei,j with degrees of freedom (r − 1)(c − 1). You can plug the test statistic and degrees of freedom into a software packagea

  • r an online calculatorb to calculate a p-value. Typically, if p < 0.05 we reject

the null hypothesis that the attribute and class are independent and say the split is statistically significant. The decision tree given on the next slide was built from the data in the table nearby. For each of the 3 internal nodes in the decision tree, show the p-value for the split and state whether it is statistically significant. How many internal nodes will the tree have if we prune splits with p ≥ 0.05?

aUse 1-chi2cdf(x,df) in MATLAB or CHIDIST(x,df) in Excel. b(https://en.m.wikipedia.org/wiki/Chi-square distribution#Table of .CF.872 value vs p-value.

66.

slide-68
SLIDE 68

Input:

X1 X2 X3 X4 Class 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

4

X

1

X

2

X 1 1 1 [4−,8+] 1 [1−,2+] 1 [0−,2+] [1−,0+] [3−,0+] [4−,2+] [0−,6+]

67.

slide-69
SLIDE 69

Idea

While traversing the ID3 tree [usually in bottom-up manner], remove the nodes for which there is not enough (“significant”) statistical evidence that there is a dependence between the values of the input attribute tested in that node and the values

  • f the output attribute (the labels),

supported by the set of instances assigned to that node.

68.

slide-70
SLIDE 70

Contingency tables

OX4 Class = 0 Class = 1 X4 = 0 4 2 X4 = 1 6

N=12

⇒      P(X4 = 0) = 6 12 = 1 2, P(X4 = 1) = 1 2 P(Class = 0) = 4 12 = 1 3, P(Class = 1) = 2 3 OX1|X4=0 Class = 0 Class = 1 X1 = 0 3 X1 = 1 1 2

N=6

⇒                        P(X1 = 0 | X4 = 0) = 3 6 = 1 2 P(X1 = 1 | X4 = 0) = 1 2 P(Class = 0 | X4 = 0) = 4 6 = 2 3 P(Class = 1 | X4 = 0) = 1 3 OX2|X4=0,X1=1 Class = 0 Class = 1 X2 = 0 2 X2 = 1 1

N=3

⇒                        P(X2 = 0 | X4 = 0, X1 = 1) = 2 3 P(X2 = 1 | X4 = 0, X1 = 1) = 1 3 P(Class = 0 | X4 = 0, X1 = 1) = 1 3 P(Class = 1 | X4 = 0, X1 = 1) = 2 3

69.

slide-71
SLIDE 71

The reasoning that leads to the computation of the expected number of observations P(A = i, C = j) = P(A = i) · P(C = j) P(A = i) = c

k=1 Oi,k

N and P(C = j) = r

k=1 Ok,j

N P(A = i, C = j)

indep.

= (c

k=1 Oi,k) (r k=1 Ok,j)

N2 E[A = i, C = j] = N · P(A = i, C = j)

70.

slide-72
SLIDE 72

Expected number of observations

EX4 Class = 0 Class = 1 X4 = 0 2 4 X4 = 1 2 4 EX1|X4 Class = 0 Class = 1 X1 = 0 2 1 X1 = 1 2 1 EX2|X4,X1=1 Class = 0 Class = 1 X2 = 0 2 3 4 3 X2 = 1 1 3 2 3 EX4(X4 = 0, Class = 0) : N = 12, P(X4 = 0) = 1 2 ¸ si P(Class = 0) = 1 3 ⇒ N · P(X4 = 0, Class = 0) = N · P(X4 = 0) · P(Class = 0) = 12 · 1 2 · 1 3 = 2 71.

slide-73
SLIDE 73

χ2 Statistics

χ2 =

r

  • i=1

c

  • j=1

(Oi,j − Ei,j)2 Ei,j χ2

X4 = (4 − 2)2

2 + (0 − 2)2 2 + (2 − 4)2 4 + (6 − 4)2 4 = 2 + 2 + 1 + 1 = 6 χ2

X1|X4=0 = (3 − 2)2

2 + (1 − 2)2 2 + (0 − 1)2 1 + (2 − 1)2 1 = 3 χ2

X2|X4=0,X1=1 =

  • 0 − 2

3 2 2 3 +

  • 1 − 1

3 2 1 3 +

  • 2 − 4

3 2 4 3 +

  • 0 − 2

3 2 2 3 = 4 9 · 27 4 = 3 p-values: 0.0143, 0.0833, and respectively 0.0833. The first one of these p-values is smaller than ε, therefore the root node (X4) cannot be prunned.

72.

slide-74
SLIDE 74

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0

χ2 − Pearson’s cumulative test statistic p−value

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0

Chi Squared Pearson test statistics

k = 1 k = 2 k = 3 k = 4 k = 6 k = 9

73.

slide-75
SLIDE 75

Output (pruned tree) for 95% confidence level

4

X 1 1

74.

slide-76
SLIDE 76

The AdaBoost algorithm: why was it designed the way it was designed, and the convergence of the training error, in certain conditions

CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5 CMU, 2009 fall, Carlos Guestrin, HW2, pr. 3.1 CMU, 2009 fall, Eric Xing, HW3, pr. 4.2.2

75.

slide-77
SLIDE 77

Consider m training examples S = {(x1, y1), . . . , (xm, ym)}, where x ∈ X and y ∈ {−1, 1}. Suppose we have a weak learning algorithm A which produces a hypothesis h : X → {−1, 1} given any distribution D of examples.

AdaBoost is an iterative algorithm which works as follows:

  • Begin with a uniform distribution D1(i) = 1

m, i = 1, . . . , m.

  • At each iteration t = 1, . . . , T ,
  • run the weak learning algo. A on the distribution Dt and produce the hypothsis ht;

Note (1): Since A is a weak learning algorithm, the produced hypothesis ht at round t is

  • nly slightly better than random guessing, say, by a margin γt:

εt = err Dt(ht) = Prx∼Dt[y = ht(x)] = 1 2 − γt. Note (2): If at a certain iteration t ≤ T the weak classifier A cannot produce a hypothesis better than random guessing (i.e., γt = 0) or it produces a hypothesis for which εt = 0, then the AdaBoost algorithm should be stopped.

  • update the distribution

Dt+1(i) = 1 Zt · Dt(i) · e−αtyiht(xi) for i = 1, . . . , m, (2) where αt

not.

= 1 2 ln 1 − εt εt , and Zt is the normalizer.

  • In the end, deliver HT = sign

T

t=1 αtht

  • as the learned hypothesis, which will act as a

weighted majority vote.

76.

slide-78
SLIDE 78

We will prove that the training error err S(HT ) of AdaBoost decreases at a very fast rate, and in certain cases it converges to 0.

Important Remark

The above formulation of the AdaBoost algorithm states no restriction on the ht hypothesis delivered by the weak classifier A at iteration t, except that εt < 1/2. However, in another formulation of the AdaBoost algorithm (in a more gen- eral setup; see for instance MIT, 2006 fall, Tommi Jaakkola, HW4, problem 3), it is requested / reccommended that hypothesis ht be chosen by (approxi- mately) maximizing the [criterion of] weighted training error on a whole class

  • f hypotheses like, for instance, decision trees of depth 1 (decision stumps).

In this problem we will not be concerned with such a request, but we will comply with it for instance in problem CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr2.6, when showing how AdaBoost works in practice. 77.

slide-79
SLIDE 79

a. Prove the following relationships:

  • i. Zt = e−αt · (1 − εt) + eαt · εt (a consequence from (2))
  • ii. Zt = 2
  • εt(1 − εt) (a consequence from i., and the value stated for αt in

the AdaBoost pseudo-code)

  • iii. 0 < Zt < 1 (a consequence derivable from ii.)
  • iv. Dt+1(i) =

       Dt(i) 2εt , i ∈ M

not.

= {i|yi = ht(xi)}, i.e., the mistake set Dt(i) 2(1 − εt), i ∈ C

not.

= {i|yi = ht(xi)}, i.e., the correct set (a consequence derivable from (2) and ii.)

  • v. εi > εj ⇒ αi < αj

(a consequence from the value stated for αt in the AdaBoost pseudo-code)

  • vi. err Dt+1(ht) = 1

Zt · eαt · εt, where err Dt+1(ht)

not.

= PrDt+1({xi|ht(xi) = yi})

  • vii. err Dt+1(ht) = 1/2 (a consequence derivable from ii. and v.)

78.

slide-80
SLIDE 80

Solution

a/i. Since Zt is the normalization factor for the distribution Dt+1, we can write: Zt =

m

  • i=1

Dt(i)e−αtyiht(xi) =

  • i∈C

Dt(i)e−αt +

  • i∈M

Dt(i)eαt= (1 − εt) · e−αt + εt · eαt. (3) a/ii. Since αt

not.

= 1 2 ln 1 − εt εt , it follows that eαt = e 1 2

ln 1 − εt

εt = e

ln

  • 1 − εt

εt =

  • 1 − εt

εt (4) and e−αt = 1 eαt =

  • εt

1 − εt . (5) So, Zt = (1 − εt) ·

  • εt

1 − εt + εt ·

  • 1 − εt

εt = 2

  • εt(1 − εt).

(6) Note that 1 − εt εt > 1 because εt ∈ (0, 1/2); therefore αt > 0. 79.

slide-81
SLIDE 81

a/iii. The second order function εt(1−εt) reaches its maximum value forr εt = 1/2, and the maximum is 1/4. Since εt ∈ (0, 1/2), it follows from (6) that Zt > 0 and Zt < 2

  • 1

4 = 1. a/iv. Based on (2), we can write: Dt+1(i) = 1 Zt · Dt(i) ·

  • eαt,

for i ∈ M e−αt, for i ∈ C. Therefore, i ∈ M ⇒ Dt+1(i) = 1 Zt · Dt(i) · eαt (4) = 1 2

  • εt(1 − εt)

· Dt(i) · √1 − εt √εt = Dt(i) 2εt i ∈ C ⇒ Dt+1(i) = 1 Zt · Dt(i) · e−αt (5) = 1 2

  • εt(1 − εt)

· Dt(i) · √εt √1 − εt = Dt(i) 2(1 − εt). 80.

slide-82
SLIDE 82

a/v. Starting from the definition αt = ln 1 − εt εt , we can write: αi < αj ⇔ ln

  • 1 − εi

εi < ln

  • 1 − εj

εj Further on, since both ln and √ functions are strictly increasing, it follows that αi < αj ⇔ 1 − εi εi < 1 − εj εj

εi,εj>0

⇔ εj(1 − εi) < εi(1 − εj) ⇔ εj −✟✟ εiεj < εi −✟✟ εiεj ⇔ εi > εj a/vi. It is easy to see that err Dt+1(ht) =

m

  • i=1

Dt+1(i) · 1{yi=ht(xi)} =

  • i∈M

1 Zt Dt(i)eαt = 1 Zt

  • i∈M

Dt(i)

  • εt

eαt = 1 Zt · εt · eαt (7) a/vii. By substituting (6) and (4) into (7), we will get: ⇒ err Dt+1(ht) = 1 Zt · εt · eαt = 1 2

  • εt(1 − εt)

· εt ·

  • 1 − εt

εt = 1 2 81.

slide-83
SLIDE 83
  • b. Show that DT +1(i) =
  • m · T

t=1 Zt

−1 e−yif(xi), where f(x) = T

t=1 αtht(x).

  • c. Show that err S(HT ) ≤ T

t=1 Zt, where err S(HT ) not.

= 1 m m

i=1 1{HT (xi)=yi} is the

traing error produced by AdaBoost.

  • d. Obviously, we would like to minimize test set error produced by AdaBoost,

but it is hard to do so directly. We thus settle for greedily optimizing the upper bound on the training error found at part c. Observe that Z1, . . . , Zt−1 are determined by the first t − 1 iterations, and we cannot change them at iteration t. A greedy step we can take to minimize the training set error bound on round t is to minimize Zt. Prove that the value of αt that minimizes Zt (among all possible values for αt) is indeed αt = 1 2 ln 1 − εt εt (see the previous slide).

  • e. Show that T

t=1 Zt ≤ e−2 T

t=1 γ2 t .

  • f. From part c and d, we know the training error decreases at exponential

rate with respect to T . Assume that there is a number γ > 0 such that γ ≤ γt for t = 1, . . . , T . (This γ is called a guarantee of empirical γ-weak learnability.) How many rounds are needed to achieve a training error ε > 0? Please express in big-O notation, T = O(·). 82.

slide-84
SLIDE 84

Solution

  • b. We will expand Dt(i) recursively:

DT +1(i) = 1 ZT DT (i) e−αT yi hT (xi) = DT (i) 1 ZT e−αT yi hT (xi) = DT −1(i) 1 ZT −1 e−αT −1 yi hT −1(xi) 1 ZT e−αT yi hT (xi) . . . = D1(i) 1 T

t=1 Zt

e− T

t=1 αt yi ht(xi)

= 1 m · T

t=1 Zt

e−yi f(xi). 83.

slide-85
SLIDE 85

c. We will make use of the fact that the exponential loss function upper bounds the 0-1 loss function, i.e. 1{x<0} ≤ e−x:

) [ x y 1 e−x 1{x<0}

err S(HT ) = 1 m

m

  • i=1

1{yif(xi)<0} ≤ 1 m

m

  • i=1

e−yi f(xi)

b.

= 1 m

m

  • i=1

DT +1(i) · m ·

T

  • t=1

Zt =

m

  • i=1

DT +1(i)

T

  • t=1

Zt = m

  • i=1

DT +1(i)

  • 1

· T

  • t=1

Zt

  • =

T

  • t=1

Zt. 84.

slide-86
SLIDE 86
  • d. We will start from the equation

Zt = εt · eαt + (1 − εt) · e−αt, which has been proven at part a. Note that the right-hand side is constant with respect to εt (the error produced by ht, the hypothesis produced by the weak classifier A at the current step). Then we will proceed as usually, computing the partial derivative w.r.t. εt: ∂ ∂αt

  • εt · eαt + (1 − εt) · e−αt

= 0 ⇔ εt · eαt − (1 − εt) · e−αt = 0 ⇔ εt · (eαt)2 = 1 − εt ⇔ e2αt = 1 − εt εt ⇔ 2αt = ln 1 − εt εt ⇔ αt = 1 2 ln 1 − εt εt . Note that 1 − εt εt > 1 (and therefore αt > 0) because εt ∈ (0, 1/2). It can also be immediately shown that αt = 1 2 ln 1 − εt εt is indeed the value for which we reach the minim of the expression εt·eαt +(1−εt)·e−αt, and therefore

  • f Zt too:

εt · eαt − (1 − εt) · e−αt > 0 ⇔ e2αt − 1 − εt εt > 0 ⇔ αt > 1 2 ln 1 − εt εt . 85.

slide-87
SLIDE 87

Plots of three Z(β) functions, Z(β) = εt · β + (1 − εt) · 1 β where β

not.

= eα, (α being free(!) here) and εt is fix. It implies that βmin =

  • 1 − εt

εt Z(βmin) = . . . = 2

  • εt(1 − εt)

αmin = ln βmin = ln

  • 1 − εt

εt

1 2 3 4 5 0.6 0.8 1.0 1.2 1.4 beta Z

eps = 1/10 eps = 1/4 eps = 2/5

86.

slide-88
SLIDE 88
  • e. Making use of the relationship (6) proven at part a,

and using the fact that 1 − x ≤ e−x for all x ∈ R, we can write:

x y 1 e−x 1 − x

T

  • t=1

Zt =

T

  • t=1

2 ·

  • εt(1 − εt)

=

T

  • t=1

2 · 1 2 − γt 1 − 1 2 − γt

  • =

T

  • t=1
  • 1 − 4γ2

t

T

  • t=1
  • e−4γ2

t

=

T

  • t=1
  • (e−2γ2

t )2 =

T

  • t=1

e−2γ2

t

= e−2 T

t=1 γ2 t .

87.

slide-89
SLIDE 89

f.

εt γt 1/2 γ

From the result obtained at parts c and d, we get: err S(HT ) ≤ e−2 T

t=1 γ2 t ≤

e−2T γ2 =

  • e−2γ2T =

1

  • e2γ2T

Therefore, err S(HT ) < ε if − 2T γ2 < ln ε ⇔ 2T γ2 > − ln ε ⇔ 2T γ2 > ln 1 ε ⇔ T > 1 2γ2 ln 1 ε Hence we need T = O 1 γ2 ln 1 ε

  • .

Note: It follows that err S(HT ) → 0 as T → ∞. 88.

slide-90
SLIDE 90

Exemplifying the application of AdaBoost algorithm

CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.6

89.

slide-91
SLIDE 91

Consider the training dataset in the nearby fig-

  • ure. Run T = 3 iterations of AdaBoost with deci-

sion stumps (axis-aligned separators) as the base

  • learners. Illustrate the learned weak hypotheses

ht in this figure and fill in the table given below.

(For the pseudo-code of the AdaBoost algorithm, see CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5. Please read the Important Remark that follows that pseudo-code!)

x1 x2 x6 x3 x7 x4

5

x

8

x

9

x X 1 X 2

1 3 2 4 1 2 3 4 5

t εt αt Dt(1) Dt(2) Dt(3) Dt(4) Dt(5) Dt(6) Dt(7) Dt(8) Dt(9) errS(H) 1 2 3

Note:

The goal of this exercise is to help you understand how AdaBoost works in practice. It is advisable that — after understanding this exercise — you would implement a program / function that calculates the weighted training error produced by a given decision stump, w.r.t. a certain probabilistic distribution (D) defined on the training dataset. Later on you will extend this program to a full-fledged implementation of AdaBoost.

90.

slide-92
SLIDE 92

Solution

Unlike the graphical reprezentation that we used until now for decision stumps (as trees of depth 1), here we will work with the following analit- ical representation: for a continuous attribute X taking values x ∈ R and for any threshold s ∈ R, we can define two decision stumps: sign(x − s) =

  • 1

if x ≥ s −1 if x < s and sign(s − x) =

  • −1

if x ≥ s 1 if x < s. For convenience, in the sequel we will denote the first decision stump with X ≥ s and the second with X < s. According to the Important Remark that follows the AdaBoost pseudo-code [see CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5], at each iter- ation (t) the weak algorithm A selects the/a decision stump which, among all decision stumps, has the minimum weighted training error w.r.t. the current distribution (Dt) on the training data. 91.

slide-93
SLIDE 93

Notes

When applying the ID3 algorithm, for each continous attribute X, we used a threshold for each pair of examples (xi, yi), (xi+1, yi+1), with yiyi+1 < 0 such that xi < xi+1, but no xj ∈ Val (X) for which xi < xj < xi+1. We will proceed similarly when applying AdaBoost with decision stumps and continous attributes. [In the case of ID3 algorithm, there is a theoretical result stating that there is no need to consider other thresholds for a continuous attribute X apart from those situated beteen pairs of successive values (xi < xi+1) having opposite labels (yi = yi+1), because the Information Gain (IG) for the other thresholds (xi < xi+1, with yi = yi+1) is provably less than the maximal IG for X. LC: A similar result can be proven, which allows us to simplify the application

  • f the weak classifier (A) in the framework of the AdaBoost algorithm.]

Moreover, we will consider also a threshold from the outside of the interval of values taken by the attribute X in the training dataset. [The decision stumps corresponding to this “outside” threshold can be associated with the decision trees of depth 0 that we met in other problems.] 92.

slide-94
SLIDE 94

Iteration t = 1: Therefore, at this stage (i.e, the first iteration of AdaBoost) the thresholds for the two continuous variables (X1 and X2) corresponding to the two coordinates

  • f the training instances (x1, . . . , x9) are
  • 1

2, 5 2, and 9 2 for X1, and

  • 1

2, 3 2, 5 2 and 7 2 for X2. One can easily see that we can get rid of the “outside” threshold 1 2 for X2, because the decision stumps corresponding to this threshold act in the same as the decision stumps associated to the “outside” threshold 1 2 for X1. The decision stumps corresponding to this iteration together with their as- sociated weighted training errors are shown on the next slide. When filling those tabels, we have used the equalities errDt(X1 ≥ s) = 1 − errDt(X1 < s) and, similarly, errDt(X2 ≥ s) = 1 − errDt(X2 < s), for any threshold s and every iteration t = 1, 2, . . .. These equalities are easy to prove. 93.

slide-95
SLIDE 95

s 1 2 5 2 9 2 errD1(X1 < s) 4 9 2 9 4 9 + 2 9 = 2 3 errD1(X1 ≥ s) 5 9 7 9 1 3 s 1 2 3 2 5 2 7 2 errD1(X2 < s) 4 9 1 9 + 3 9 = 4 9 2 9 + 1 9 = 1 3 2 9 errD1(X2 ≥ s) 5 9 5 9 2 3 7 9

It can be seen that the minimal weighted training error (ε1 = 2/9) is obtained for the decision stumps X1 < 5/2 and X2 < 7/2. Therefore we can choose h1 = sign 7 2 − X2

  • as

best hypothesis at iteration t = 1; the corresponding separator is the line X2 = 7

  • 2. The

h1 hypothesis wrongly classifies the instances x4 and x5. Then γ1 = 1 2 − 2 9 = 5 18 and α1 = 1 2 ln 1 − ε1 ε1 = ln

  • 1 − 2

9

  • : 2

9 = ln

  • 7

2 ≈ 0.626 94.

slide-96
SLIDE 96

Now the algorithm must get a new distribution (D2) by altering the old one (D1) so that the next iteration concentrates more on the misclassified instances. D2(i) = 1 Z1 D1(i)( e−α1 √

2/7

)yi h1(xi) =          1 Z1 · 1 9 ·

  • 2

7 for i ∈ {1, 2, 3, 6, 7, 8, 9}; 1 Z1 · 1 9 ·

  • 7

2 for i ∈ {4, 5}. Remember that Z1 is a normalization factor for D2. So, Z1 = 1 9

  • 7 ·
  • 2

7 + 2 ·

  • 7

2

  • = 2

√ 14 9 = 0.8315 Therefore, D2(i) =          9 2 √ 14 · 1 9 ·

  • 2

7 = 1 14 for i ∈ {4, 5}; 9 2 √ 14 · 1 9 ·

  • 7

2 = 1 4 for i ∈ {4, 5}.

X 2 x3 x1

2

x

4

x

9

x

8

x h1 X 1

1 1 2 3 3 2 4 4 5 1/4 1/4 1/14 1/14 1/14 1/14 1/14 1/14 1/14 +

6 7 5

x x x

95.

slide-97
SLIDE 97

Note

If, instead of sign 7 2 − X2

  • we would have taken, as hypothesis h1, the deci-

sion stump sign 5 2 − X1

  • , the subsequent calculus would have been slightly

different (although both decision stumps have the same – minimal – weighted training error, 2 9): x8 and x9 would have been allocated the weights 1 4, while x4 ¸ si x5 would have been allocated the weights 1 14. (Therefore, the output of AdaBoost may not be uniquely determined!) 96.

slide-98
SLIDE 98

Iteration t = 2:

s 1 2 5 2 9 2 errD2(X1 < s) 4 14 2 14 2 14 + 2 4 + 2 14 = 11 14 errD2(X1 ≥ s) 10 14 12 14 3 14 s 1 2 3 2 5 2 7 2 errD2(X2 < s) 4 14 1 4 + 3 14 = 13 28 2 4 + 1 14 = 8 14 2 4 = 1 2 errD2(X2 ≥ s) 10 14 15 28 6 14 1 2

Note: According to the theoretical result presented at part a of CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5, computing the weighted error rate of the decision stump [corresponding to the test] X2 < 7/2 is now super- fluous, because this decision stump has been chosen as optimal hypothesis at the previous iteration. (Nevertheless, we had placed it into the tabel, for the sake of a thorough presentation.) 97.

slide-99
SLIDE 99

Now the best hypothesis is h2 = sign 5 2 − X1

  • ; the corresponding separator

is the line X1 = 5 2. ε2 = PD2({x8, x9}) = 2 14 = 1 7 = 0.143 ⇒ γ2 = 1 2 − 1 7 = 5 14 α2 = ln

  • 1 − ε2

ε2 = ln

  • 1 − 1

7

  • : 1

7 = ln √ 6 = 0.896 D3(i) = 1 Z2 · D2(i) · ( e−α2

1/ √ 6

)yi h2(xi) =        1 Z2 · D2(i) · 1 √ 6 if h2(xi) = yi; 1 Z2 · D2(i) · √ 6

  • therwise

=                1 Z2 · 1 14 · 1 √ 6 for i ∈ {1, 2, 3, 6, 7}; 1 Z2 · 1 4 · 1 √ 6 for i ∈ {4, 5}; 1 Z2 · 1 14 · √ 6 for i ∈ {8, 9}. 98.

slide-100
SLIDE 100

Z2 = 5 · 1 14 · 1 √ 6 + 2 · 1 4 · 1 √ 6 + 2 · 1 14 · √ 6 = 5 14 √ 6 + 1 2 √ 6 + √ 6 7 = 12 + 12 14 √ 6 = 24 14 √ 6 = 2 √ 6 7 ≈ 0.7 D3(i) =                  7 2 √ 6 · 1 14 · 1 √ 6 = 1 24 for i ∈ {1, 2, 3, 6, 7}; 7 2 √ 6 · 1 4 · 1 √ 6 = 7 48 for i ∈ {4, 5}; 7 2 √ 6 · 1 14 · √ 6 = 1 4 for i ∈ {8, 9}.

X 2 h2 h1 X 1 x1

5

x x9 x4 x3 x2 x6 x7

1 1 2 3 3 2 4 4 5 7/48 1/4 1/4 1/24 1/24 1/24 1/24 7/48 1/24 +

− −

+ 8

x

99.

slide-101
SLIDE 101

Iteration t = 3:

s 1 2 5 2 9 2 errD3(X1 < s) 2 24 + 2 4 = 7 12 2 4 2 24 + 2 · 7 48 + 2 · 1 4 = 21 24 errD3(X1 ≥ s) 5 12 2 4 3 24 = 1 8 s 1 2 3 2 5 2 7 2 errD3(X2 < s) 7 12 7 48 + 2 24 + 1 4 = 23 48 2 · 7 48 + 1 24 = 1 3 2 · 7 48 = 7 24 errD3(X2 ≥ s) 5 12 25 48 2 3 17 24

100.

slide-102
SLIDE 102

The new best hypothesis is h3 = sign

  • X1 − 9

2

  • ; the corresponding separator is the line

X1 = 9 2. ε3 = PD3({x1, x2, x7}) = 2 · 1 24 + 1 24 = 3 24 = 1 8 γ3 = 1 2 − 1 8 = 3 8 α3 = ln

  • 1 − ε3

ε3 = ln

  • 1 − 1

8

  • : 1

8 = ln √ 7 = 0.973

2

X h2 h3 h1 X 1 x1 x2 x9 x8 x7 x4 x5 x6

3

x

1 1 2 3 3 2 4 4 5 + +

− − −

+

101.

slide-103
SLIDE 103

Finally, after filling our results in the given table, we get:

t εt αt Dt(1) Dt(2) Dt(3) Dt(4) Dt(5) Dt(6) Dt(7) Dt(8) Dt(9) errS(H) 1 2/9 ln

  • 7/2

1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 2/9 2 2/14 ln √ 6 1/14 1/14 1/14 1/4 1/4 1/14 1/14 1/14 1/14 2/9 3 1/8 ln √ 7 1/24 1/24 1/24 7/48 7/48 1/24 1/24 1/4 1/4

Note: The following table helps you understand how errS(H) was computed; remember that H(xi)

def.

= sign T

t=1 αt ht(xi)

  • .

t αt x1 x2 x3 x4 x5 x6 x7 x8 x9 1 0.626 +1 +1 −1 +1 +1 −1 −1 +1 +1 2 0.896 +1 +1 −1 −1 −1 −1 −1 −1 −1 3 0.973 −1 −1 −1 −1 −1 −1 +1 +1 +1 H(xi) +1 +1 −1 −1 −1 −1 −1 +1 +1 102.

slide-104
SLIDE 104

Remark: One can immediately see that the [test] instance (1, 4) will be classified by the hypothesis H learned by AdaBoost as negative (since −α1 + α2 − α3 = −0.626 + 0.896 − 0.973 < 0). After making other sim- ilar calculi, we can conclude that the de- cision zones and the decision boundaries produced by AdaBoost for the given train- ing data will be as indicated in the nearby figure.

2

X h2 h3 h1 X 1 x7 x8 x9 x1 x2

3

x x6 x4 x5

1 1 2 3 3 2 4 4 5

Remark: The execution of AdaBoost could continue (if we would have taken initially T > 3...), although we have obtained errS(H) = 0 at iteration t = 3. By elaborating the details, we would see that for t = 4 we would obtain as optimal hypothesis X2 < 7/2 (which has already been selected at iteration t = 1). This hypothesis produces now the weighted training error ε4 = 1/6. Therefore, α4 = ln √ 5, and this will be added to α1 = ln

  • 7/2 in the new output H. In this way, the confidence in the hypothesis X2 < 7/2

would be strenghened. So, we should keep in mind that AdaBoost can select several times a certain weak hypothesis (but never at consecutive iterations, cf. CMU, 2015 fall, E. Xing, Z. Bar- Joseph, HW4, pr. 2.1). 103.

slide-105
SLIDE 105

Graphs made by MSc student Sebastian Ciobanu (2018 fall)

The variation of εt w.r.t. t:

10 20 30 40 0.14 0.16 0.18 0.20 0.22 iteration epsilon

The two upper bounds of the empirical error of HT :

10 20 30 40 0.0 0.2 0.4 0.6 0.8 T exp(−2∑

t=1 T

γt

2)

t=1 T

Zt errS(HT)

104.

slide-106
SLIDE 106

AdaBoost and [non] empirical γ-weak learnability: Exemplification on a dataset from R

CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, final, pr. 8.a-e

105.

slide-107
SLIDE 107

In this problem, we study how Ad- aBoost performs on a very sim- ple classification problem shown in the nearby figure.

4 3 2 1 x 5 6

We use decision stump for each weak hypothesis hi. Decision stump classifier chooses a constant value s and classifies all points where x > s as one class and other points where x ≤ s as the other class.

  • a. What is the initial weight that is assigned to each data point?

b. Show the decision boundary for the first decision stump (indicate the positive and negative side of the decision boundary).

  • c. Circle the point whose weight increases in the boosting process.
  • d. Write down the weight that is assigned to each data point after the first

iteration of boosting algorithm.

  • e. Can boosting algorithm perfectly classify all the training examples? If no,

briefly explain why. If yes, what is the minimum number of iterations? 106.

slide-108
SLIDE 108

Answer

With outside threshold:

  • t = 1 :

h1 3 1 5 err: 1/3 1/3 1/3 1/3

  • t = 2 :

h2 3 1 5 1/2 1/4 1/4 err: 1/2 err: 1/4 err: 1/4

  • t = 3 :

h3 3 1 5 1/3 1/6 1/2 err: 1/3 err: 1/2 err: 1/6

Without outside threshold:

  • t = 1 : h1

3 1 5 1/3 1/3 1/3 err: 1/3

  • t = 2 :

h2 3 1 5 1/4 1/4 1/2 err: 1/2 err: 1/4

  • t = 3 : h3

3 1 5 1/2 1/6 1/3 err: 1/3 err: 1/2

  • t = 4 :

h4 3 1 5 3/8 1/8 1/2 err: 1/2 err: 3/8

107.

slide-109
SLIDE 109

With outside threshold:

  • t = 1 :

ε1 = 1/3 ⇒ α1 = ln √ 2 = 0.3465, errS(H1) = 1/3

  • t = 2 :

ε2 = 1/4 ⇒ α2 = ln √ 3 = 0.5493 x1 x2 x3 α1 − − − α2 − + + H2(xi) − + + ⇒ errS(H2) = 1/3

  • t = 3 :

ε3 = 1/6 ⇒ α3 = ln √ 5 = 0.8047 x1 x2 x3 α1 − − − α2 − + + α3 + + − H3(xi) − + − ⇒ errS(H3) = 0 Without outside threshold:

  • t = 1 :

ε1 = 1/3 ⇒ α1 = ln √ 2 = 0.3465, errS(H1) = 1/3

  • t = 2 : ε2 = 1/4 ⇒ α2 = ln

√ 3 = 0.5493 x1 x2 x3 α1 − + + α2 + + − H2(xi) + + − ⇒ errS(H2) = 1/3

  • t = 3 : ε3 = 1/3 = ε1 ⇒ α3 = ln

√ 2 = 0.3465 = α1 x1 x2 x3 α1 − + + α2 + + − α3 − + + H3(xi) − + + ⇒ errS(H3) = 1/3

  • t = 4 : ε4 = 3/8 ⇒ α4 = ln
  • 5/3 = 0.2554

x1 x2 x3 α1 − + + α2 + + − α3 − + + α4 + + − H4(xi) + + − ⇒ errS(H4) = 1/3 It can be easily proven that the signs of x1 and x3 will always be opposite to each other, while the sign of x2 will always be +. Therefore errS(HT ) = 1/3 for any T ∈ N∗.

108.

slide-110
SLIDE 110

Graphs made by Sebastian Ciobanu with outside threshold

10 20 30 40 0.20 0.25 0.30 iteration epsilon

without outside threshold

10 20 30 40 0.25 0.30 0.35 0.40 0.45 iteration epsilon 10 20 30 40 0.0 0.2 0.4 0.6 0.8 T exp(−2∑

t=1 T

γt

2)

t=1 T

Zt errS(HT) 10 20 30 40 0.3 0.4 0.5 0.6 0.7 0.8 0.9 T exp(−2∑

t=1 T

γt

2)

t=1 T

Zt errS(HT)

109.

slide-111
SLIDE 111

Seeing AdaBoost as an optimization algorithm, w.r.t. the [negative] exponential loss function

CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1 CMU, 2008 fall, Eric Xing, midterm, pr. 5.1

110.

slide-112
SLIDE 112

At CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5, part d, we have shown that in AdaBoost, we try to [indirectly] minimize the training error err S(H) by sequen- tially minimizing its upper bound T

t=1 Zt, i.e. at each iteration t (1 ≤ t ≤ T ) we choose

αt so as to minimize Zt (viwed as a function of αt). Here you will see that another way to explain AdaBoost is by sequentially minimizing the negative exponential loss: JT

def.

= 1 m

m

  • i=1

exp(−yifT (xi))

not.

= 1 m

m

  • i=1

exp(−yi

T

  • t=1

αtht(xi)). (8) That is to say, at the t-th iteration (1 ≤ t ≤ T ) we want to choose besides the appropriate classifier ht the corresponding weight αt so that the overall loss Jt (accumulated up to the t-th iteration) is minimized. Prove that this [new] strategy will lead to the same update rule for αt used in AdaBoost, i.e., αt = 1 2 ln 1 − εt εt . Hint: You can use the fact that Dt(i) ∝ exp(−yift−1(xi)), and it [LC: the proportionality factor] can be viewed as constant when we try to optimize Jt with respect to αt in the t-th iteration. 111.

slide-113
SLIDE 113

Solution

At the t-th iteration, we have Jt = 1 m

m

  • i=1

exp(−yift(xi)) = 1 m

m

  • i=1

exp

  • −yi

t−1

  • t′=1

αt′ht′(xi)

  • − yiαtht(xi)
  • =

1 m

m

  • i=1

exp(−yift−1(xi)) · exp(−yiαtht(xi)) = 1 m

m

  • i=1
  • m

t−1

  • i=1

Zt

  • · Dt(i) · exp(−yiαtht(xi))

m

  • i=1

Dt(i) · exp(−yiαtht(xi))

not.

= J′

t(see CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5, part b)

Further on, we can rewrite J′

t as

J′

t

=

m

  • i=1

Dt(i) · exp(−yiαtht(xi)) =

  • i∈C

Dt(i) exp(−αt) +

  • i∈M

Dt(i) exp(αt) = (1 − εt) · e−αt + εt · eαt, (9) where C is the set of examples which are correctly classified by ht, and M is the set of examples which are mis-classified by ht. 112.

slide-114
SLIDE 114

The relation (9) is identical with the expression (3) from part a of CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5 (see the solution). Therefore, when εt is fixed, Jt will reach its minim for αt = 1 2 ln 1 − εt εt . 113.

slide-115
SLIDE 115

Graph made for the J′

t functions

by Sebastian Ciobanu

for CMU, 2015 fall,

  • Z. Bar-Joseph, E. Xing

HW4, pr. 2.6 (Notation: β = eα)

1 2 3 4 5 0.4 0.6 0.8 1.0 1.2 1.4 beta Z

eps1 = 2/9 eps2 = 1/7 eps3 = 1/8

114.

slide-116
SLIDE 116

Important Remark

(in Romanian) Dac˘ a rescriem expresia (9) sub forma e−αt + εt (eαt − e−αt)

  • >0

, se observ˘ a c˘ a dac˘ a fix˘ am αt atunci a minimiza ex- presia Jt ˆ ın raport cu [alegerea lui] ht revine la a minimiza εt (care nu depinde de αt). ˆ In consecint ¸˘ a, algoritmul AdaBoost poate fi re- formulat ca un algoritm de optimizare secvent ¸ial˘ a a costurilor / ,,pierder- ilor“ Jt. Input: {(xi, yi)}i=1,...,m – the training dataset, T – the number

  • f iterations to be executed, H – the set of weak hipotheses,

φ(y, y′) = exp(−yy′) – the exponential loss function. Procedure: Initialize the classifier f0(x) = 0 and calculate the distribution D1(i) = 1/m for i = 1, . . . , m for t = 1 to T do:

  • 1. Compute

(ht, αt) = arg minh∈H,α∈R

m

  • i=1

φ(yi, ft−1(xi) + αh(xi))

  • Jt(h,α)
  • 2. Update the classifier ft(x) = ft−1(x) + αtht(x)

and calculate the new distribution, Dt+1 end for return the classifier sign (fT (x)) The intuition is that, at each step, the algorithm greedily adds a hypothesis h ∈ H to the current hypothesis to minimize the φ-risk.

115.

slide-117
SLIDE 117

AdaBoost algorithm: the notion of [voting] margin; some properties

CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3

116.

slide-118
SLIDE 118

Despite that model complexity increases with each iteration, AdaBoost does not usually

  • verfit.

The reason behind this is that the model becomes more “confident” as we increase the number of iterations. The “confidence” can be expressed mathematically as the [voting] margin. Recall that after the algorithm AdaBoost terminates with T iterations the [output] classifier is HT (x) = sign T

  • t=1

αt ht(x)

  • .

Similarly, we can define the intermediate weighted classifier after k iterations as: Hk(x) = sign k

  • t=1

αt ht(x)

  • .

As its output is either −1 or 1, it does not tell the confidence of its judgement. Here, without changing the decision rule, let Hk(x) = sign k

  • t=1

¯ αt ht(x)

  • ,

where ¯ αt = αt k

t′=1 αt′

so that the weights on each weak classifier are normalized. 117.

slide-119
SLIDE 119

Define the margin after the k-th iteration as [the sum of] the [normalized] weights of ht voting correctly minus [the sum of] the [normalized] weights of ht voting incorrectly. Margink(x) =

  • t:ht(x)=y

¯ αt −

  • t:ht(x)=y

¯ αt.

  • a. Let fk(x)

not.

= k

t=1 ¯

αt ht(x). Show that Margink(xi) = yi fk(xi) for all training instances xi, with i = 1, . . . , m.

  • b. If Margink(xi) > Margink(xj), which of the samples xi and xj will receive a

higher weight in iteration k + 1? Hint: Use the relation Dk+1(i) = 1 m · k

t=1 Zt

· exp(−yifk(xi)) which was proven at CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.2. 118.

slide-120
SLIDE 120

Solution

  • a. We will prove the equality starting from its right hand side:

yi fk(xi) = yi

k

  • t=1

¯ αt ht(xi) =

k

  • t=1

¯ αt yi ht(xi) =

  • t:ht(xi)=yi

¯ αt −

  • t:ht(xi)=yi

¯ αt = Margink(xi).

  • b. According to the relationship already proven at part a,

Margink(xi) > Margink(xj) ⇔ yi fk(xi) > yj fk(xj) ⇔ −yi fk(xi) < −yj fk(xj) ⇔ exp(−yi fk(xi)) < exp(−yj fk(xj)). Based on the given Hint, it follows that Dk+1(i) < Dk+1(j). 119.

slide-121
SLIDE 121

Important Remark

It can be shown that boosting tends to increase the margins of training examples — see the relation (8) at CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1 —, and that a large margin on training examples reduces the generalization error. Thus we can explain why, although the number of “parameters” of the model created by AdaBoost increases with 2 at every iteration — therefore complexity rises —, it usually doesn’t overfit.

120.

slide-122
SLIDE 122

AdaBoost: a sufficient condition for γ-weak learnability, based on the voting margins

CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.1.4

121.

slide-123
SLIDE 123

At CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5 we encountered the notion of empirical γ-weak learnability. When this condition — γ ≤ γt for all t, where γt

def.

= 1 2 −εt, with εt being the weighted training error produced by the weak hypothesis ht — is met, it ensures that AdaBoost will drive down the training error quickly. However, this condition does not hold all the time. In this problem we will prove a sufficient condition for empirical weak learn- ability [to hold]. This condition refers to the notion of voting margin which was presented in CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3. Namely, we will prove that if there is a constant θ > 0 such that the [voting] margins of all training instances are lower-bounded by θ at each iteration of the AdaBoost algorithm, then the property of empirical γ-weak learnability is “guaranteed”, with γ = θ/2. 122.

slide-124
SLIDE 124

[Formalisation]

Suppose we are given a training set S = {(x1, y1), . . . , (xm, ym)}, such that for some weak hypotheses h1, . . . , hk from the hypothesis space H, and some non- negative coefficients α1, . . . , αk with k

j=1 αj = 1, there exists θ > 0 such that

yi(

k

  • j=1

αjhj(xi)) ≥ θ, ∀(xi, yi) ∈ S. Note: according to CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3, yi(

k

  • j=1

αjhj(xi)) = Margink(xi) = fk(xi), where fk(xi)

not.

=

k

  • j=1

αjhj(xi). Key idea: We will show that if the condition above is satisfied (for a given k), then for any distribution D over S, there exists a hypothesis hl ∈ {h1, . . . , hk} with weighted training error at most 1 2 − θ 2 over the distribution D. It will follow that when the condition above is satisfied for any k, the training set S is empirically γ-weak learnable, with γ = θ 2. 123.

slide-125
SLIDE 125
  • a. Show that — if the condition stated above is met — there exists a weak

hypothesis hl from {h1, . . . , hk} such that Ei∼D[yihl(xi)] ≥ θ. Hint: Taking expectation under the same distribution does not change the inequality conditions.

  • b. Show that the inequality Ei∼D[yihl(xi)] ≥ θ is equivalent to

Pri∼D[yi = hl(xi)]

  • errD(hl)

≤ 1 2 − θ 2, meaning that the weighted training error of hl is at most 1 2 − θ 2, and therefore γt ≥ θ 2. 124.

slide-126
SLIDE 126

Solution

  • a. Since yi(k

j=1 αjhj(xi)) ≥ θ ⇔ yifk(xi) ≥ θ for i = 1, . . . , m, it follows (according to the

Hint) that Ei∼D[yifk(xi)] ≥ θ where fk(xi)

not.

=

k

  • j=1

αjhj(xi). (10) On the other side, Ei∼D[yihl(xi)] ≥ θ

def.

⇔ m

i=1 yihl(xi) · D(i) ≥ θ.

Suppose, on contrary, that Ei∼D[yihl(xi)] < θ, that is m

i=1 yihl(xi)·D(i) < θ for l = 1, . . . , k.

Then m

i=1 yihl(xi) · D(i) · αl < θ · αl for l = 1, . . . , k. By summing up these inequations for

l = 1, . . . , k we get

k

  • l=1

m

  • i=1

yihl(xi) · D(i) · αl <

k

  • l=1

θ · αl ⇔

m

  • i=1

yi D(i) k

  • l=1

hl(xi)αl

  • < θ

k

  • l=1

αl ⇔

m

  • i=1

yifk(xi) · D(i) < θ, (11) because k

j=1 αj = 1 ¸

si fk(xi)

not.

= k

l=1 αlhl(xi).

The inequation (11) can be written as Ei∼D[yifk(xi)] < θ. Obviously, it contradicts the relationship (10). Therefore, the previous supposition is false. In conclusion, there exist l ∈ {1, . . . , k} such that Ei∼D[yihl(xi)] ≥ θ. 125.

slide-127
SLIDE 127

Solution (cont’d)

  • b. We already said that Ei∼D[yihl(xi)] ≥ θ ⇔ m

i=1 yi hl(xi) · D(i) ≥ θ.

Since yi ∈ {−1, +1} and hl(xi) ∈ {−1, +1} for i = 1, . . . , m and l = 1, . . . , k, we have

m

  • i=1

yi hl(xi) · D(i) ≥ θ ⇔

  • i:yi=hl(xi)

D(xi) −

  • i:yi=hl(xi)

D(xi) ≥ θ ⇔ (1 − εl) − εl ≥ θ ⇔ 1 − 2εl ≥ θ ⇔ 2εl ≤ 1 − θ ⇔ εl ≤ 1 2 − θ 2

def.

⇔ errD(hl) ≤ 1 2 − θ 2. 126.

slide-128
SLIDE 128

Graphs made by Sebastian Ciobanu

for CMU, 2012 fall,

  • T. Mitchell, Z. Bar-Joseph,

final, pr. 8.a-e with outside threshold

5 10 15 20 25 30 −1.0 −0.5 0.0 0.5 1.0 iteration Margin

without outside threshold

5 10 15 20 25 30 −1.0 −0.5 0.0 0.5 1.0 iteration Margin

for CMU, 2006 spring, Carlos Guestrin, final, pr. 3 [with outside threshold]

5 10 15 20 25 30 −1.0 −0.5 0.0 0.5 1.0 iteration Margin

127.

slide-129
SLIDE 129

AdaBoost: Any set of consistently labelled instances from R is empirically γ-weak learnable by using decision stumps

Stanford, 2016 fall, Andrew Ng, John Duchi, HW2, pr. 6.abc

128.

slide-130
SLIDE 130

At CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5 we encountered the notion of empirical γ-weak learnability. When this condition — γ ≤ γt for all t, where γt

def.

= 1 2 − εt, with εt being the weighted training error produced by the weak hypothesis ht — is met, it ensures that AdaBoost will drive down the training error quickly. In this problem we will assume that our input attribute vectors x ∈ R, that is, they are one-dimensional, and we will show that [LC] when these vectors are consitently labelled, decision stumps based on thresholding provide a weak- learning guarantee (γ). 129.

slide-131
SLIDE 131

Decision stumps: analytical definitions / formalization

Thresholding-based decision stumps can be seen as functions indexed by a threshold s and sign +/−, such that φs,+(x) = 1 if x ≥ s −1 if x < s and φs,−(x) = −1 if x ≥ s 1 if x < s. Therefore, φs,+(x) = −φs,−(x).

Key idea for the proof

We will show that given a consistently labelled training set S = {(x1, y1), . . . , (xm, ym)}, with xi ∈ R and yi ∈ {−1, +1} for i = 1, . . . , m, there is some γ > 0 such that for any distribution p defined on this training set there is a threshold s ∈ R for which errorp(φs,+) ≤ 1 2 − γ

  • r

errorp(φs,−) ≤ 1 2 − γ, where errorp(φs,+) and errorp(φs,−) denote the weighted training error of φs,+ and respectively φs,−, computed according to the distribution p. 130.

slide-132
SLIDE 132

Convention: In our problem we will assume that our training instances x1, . . . xm ∈ R are distinct. Moreover, we will assume (without loss of gen- erality, but this makes the proof notationally simpler) that x1 > x2 > . . . > xm. a. Show that, given S, for each threshold s ∈ R there is some m0(s) ∈ {0, 1, . . ., m} such that errorp(φs,+)

def.

=

m

  • i=1

pi · 1{yi=φs,+(xi)} = 1 2 − 1 2  

m0(s)

  • i=1

yipi −

m

  • i=m0(s)+1

yipi  

  • not.: f(m0(s))

and errorp(φs,−)

def.

=

m

  • i=1

pi · 1{yi=φs,−(xi)} = 1 2 − 1 2  

m

  • i=m0(s)+1

yipi −

m0(s)

  • i=1

yipi  

  • not.: −f(m0(s))

Note: Treat sums over empty sets of indices as zero. Therefore, 0

i=1 ai = 0

for any ai, and similarly m

i=m+1 ai = 0.

131.

slide-133
SLIDE 133
  • b. Prove that, given S, there is some γ > 0 (which may depend on the training

set size m) such that for any set of probabilities p on the training set (therefore pi ≥ 0 and m

i=1 pi = 1) we can find m0 ∈ {0, . . ., m} so as

|f(m0)| ≥ 2γ, where f(m0)

not.

=

m0

  • i=1

yipi −

m

  • i=m0+1

yipi, Note: γ should not depend on p. Hint: Consider the difference f(m0) − f(m0 − 1). What is your γ? 132.

slide-134
SLIDE 134
  • c. Based on your answers to parts a and b, what edge can thresholded decision

stumps guarantee on any training set {xi, yi}m

i=1, where the instances xi ∈ R

are all distinct? Recall that the edge of a weak classifier φ : R → {−1, 1} is the constant γ ∈ (0, 1/2) such that errorp(φ)

def.

=

m

  • i=1

pi · 1{φ(xi)=yi} ≤ 1 2 − γ. d. Can you give an upper bound on the number of thresholded decision stumps required to achieve zero error on a given training set? 133.

slide-135
SLIDE 135

Solution

  • a. We perform several algebraic steps.

Let sign(t) = 1 if t ≥ 0, and sign(t) = −1 otherwise. Then 1{φs,+(x)=y} = 1{sign(x−s)=y} = 1{y·sign(x−s)≤0}, where the symbol 1{ } denotes the well known indicator function. Thus we have errorp(φs,+)

def.

=

m

  • i=1

pi · 1{yi=φs,+(xi)} =

m

  • i=1

pi · 1{yi·sign(xi−s)≤0} =

  • i:xi≥s

pi · 1{yi=−1} +

  • i:xi<s

pi · 1{yi=1} Thus, if we let m0(s) be the index in {0, . . ., m} such that xi ≥ s for i ≤ m0(s) and xi < s for i > m0(s), which we know must exist because x1 > x2 > . . . , we have errorp(φs,+)

def.

=

m

  • i=1

pi · 1{yi=φs,+(xi)} =

m0(s)

  • i=1

pi · 1{yi=−1} +

m

  • i=m0(s)+1

pi · 1{yi=1}. 134.

slide-136
SLIDE 136

Now we make a key observation: we have 1{y=−1} = 1 − y 2 and 1{y=1} = 1 + y 2 , because y ∈ {−1, 1}. Consequently, errorp(φs,+)

def.

=

m

  • i=1

pi · 1{yi=φs,+(xi)} =

m0(s)

  • i=1

pi · 1{yi=−1} +

m

  • i=m0(s)+1

pi · 1{yi=1} =

m0(s)

  • i=1

pi · 1 − yi 2 +

m

  • i=m0(s)+1

pi · 1 + yi 2 = 1 2

m

  • i=1

pi − 1 2

m0(s)

  • i=1

piyi + 1 2

m

  • i=m0(s)+1

piyi = 1 2 − 1 2  

m0(s)

  • i=1

piyi −

m

  • i=m0(s)+1

piyi   . The last equality follows because m

i=1 pi = 1.

The case for φs,− is symmetric to this one, so we omit the argument. 135.

slide-137
SLIDE 137

Solution (cont’d)

  • b. For any m0 ∈ {1, . . ., m} we have

f(m0) − f(m0 − 1) =

m0

  • i=1

yipi −

m

  • i=m0+1

yipi −

m0−1

  • i=1

yipi +

m

  • i=m0

yipi = 2ym0pm0. Therefore, |f(m0) − f(m0 − 1)| = 2|ym0| pm0 = 2pm0 for all m0 ∈ {1, . . . , m}. Because m

i=1 pi = 1, there must be at least one index m′ 0 with pm′

0 ≥ 1

m. Thus we have |f(m′

0) − f(m′ 0 − 1)| ≥ 2

m, and so it must be the case that at least

  • ne of

|f(m′

0)| ≥ 1

m

  • r

|f(m′

0 − 1)| ≥ 1

m holds. Depending on which one of those two inequations is true, we would then “return” m′

0 or m′ 0 − 1.

(Note: If |f(m′

0 − 1)| ≥ 1

m and m′

0 = 1, then we have to consider an “outside”

threshold, s > x1.) Finally, we have γ = 1 2m. 136.

slide-138
SLIDE 138

Solution (cont’d)

  • c. The inequation proven at part b, namely |f(m0)| ≥ 2γ implies that either

f(m0) ≥ 2γ or f(m0) ≤ −2γ. So, either f(m0) ≥ 2γ ⇔ −f(m0) ≤ −2γ ⇔ 1 2 − 1 2f(m0)

  • errorp(φs,+)

≤ 1 2 − 1 2 · 2γ = 1 2 − γ

  • r f(m0) ≤ −2γ ⇔ 1

2 + 1 2f(m0)

  • errorp(φs,−)

≤ 1 2 − 1 2 · 2γ = 1 2 − γ for any s ∈ (xm0+1, xm0],a Therefore thresholded decision stumps are guaranteed to have an edge of at least γ = 1 2m over random guessing.

aIn the case described by the Note at part b, we must consider s > x1, that is s ∈ (x1, +∞).

137.

slide-139
SLIDE 139

Summing up

At each iteration t executed by AdaBoost,

  • a probabilistic distribution p (denoted as Dt in CMU, 2015, Z. Bar-

Joseph, E. Xing, HW4, pr. 2.1-5) is in use;

  • [at part b of the present exercise we proved that]

there is at least one m0 (better denoted m0(p)) in {0, . . . , m} such that |f(m0)| ≥ 1 m

not.

= 2γ, where f(m0)

def.

=

m0

  • i=1

yipi −

m

  • i=m0+1

yipi

  • [the proof made at parts a and c of the present exercise implies that]

for any s ∈ (xm0+1, xm0], a errorp(φs,+) ≤ 1 2 − γ or errorp(φs,−) ≤ 1 2 − γ, where γ

not.

= 1 2m, As a consequence, AdaBoost can choose at each iteration a weak hypothesis (ht) for which γt ≥ γ = 1 2m.

aSee the previous footnote.

138.

slide-140
SLIDE 140

Solution (cont’d)

  • d. Boosting takes ln m

2γ2 iterations to achieve zero [training] error, as shown at CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5, so with decision stumps we will achieve zero [training] error in at most 2m2 ln m iterations of boosting. Each iteration of boosting introduces a single new weak hypothesis, so at most 2m2 ln m thresholded decision stumps are necessary. 139.

slide-141
SLIDE 141

A generalized version of the AdaBoost algorithm

MIT, 2003 fall, Tommy Jaakkola, HW4, pr. 2.1-3

140.

slide-142
SLIDE 142

Here we derive a boosting algorithm from a slightly more general perspective than the AdaBoost algorithm in CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5, that will be applicable for a class of loss functions including the exponential one. The goal is to generate discriminant functions of the form fT (x) = α1h(x; θ1) + . . . + αT h(x; θT ), where both x belong to Rd, θ are parameters, and you can assume that the weak classifiers h(x; θ) are decision stumps whose predictions are ±1; any other set of weak learners would be fine without modification. We successively add components to the overall discriminant function in a manner that will separate the estimation of [the parameters of] the weak classifiers from the setting of the votes α to the extent possible. 141.

slide-143
SLIDE 143

A useful definition

Let’s start by defining a set of useful loss functions. The only restriction we place on the loss is that it should be a monotonically decreasing and differentiable function of its

  • argument. The argument in our context is yi fT (xi) so that the more the discriminant

function agrees with the ±1 label yi, the smaller the loss. The simple exponential loss we have already considered [at CMU, 2015 fall, Z. Bar- Joseph, E. Xing, HW4, pr. 2.1-5], i.e., Loss(yifT (xi)) = exp(−yifT(xi)) certainly conforms to this notion. And so does the logistic loss Loss(yifT (xi)) = ln(1 + exp(−yifT(xi))).

0.5 1 1.5 2 2.5 3 3.5 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4 z

  • log(sigma(z))

sigma(z)

142.

slide-144
SLIDE 144

Remark

Note that the logistic loss has a nice interpretation as a negative log-

  • probability. Indeed, [recall that] for an additive logistic regression model

− ln P(y = 1|x, w) = − ln 1 1 + exp(−z) = ln(1 + exp(−z)), where z = w1φ1(x)+. . .+wT φT (x) and we omit the bias term (w0) for simplicity. By replacing the additive combination of basis functions (φi(x)) with the combination of weak classifiers (h(x; θi)), we have an additive logistic regression model where the weak classifiers serve as the basis functions. The difference is that both the basis functions (weak classifiers) and the coefficients multiplying them will be estimated. In the logistic regression model we typically envision a fixed set of basis functions. 143.

slide-145
SLIDE 145

Let us now try to derive the boosting algorithm in a manner that can acco- modate any loss function of the type discussed above. To this end, at the current iteration (t) we suppose that we have already included t−1 component classifiers ft−1(x) = α1h(x; ˆ θ1) + . . . + αt−1h(x; ˆ θt−1), (12) and we wish to add another h(x; θ). The estimation criterion for the overall discriminant function, including the new component with votes α, is given by Jt(α, θ) = 1 m

m

  • i=1

Loss(yift−1(xi) + yi α h(xi; θ)). Note that we explicate only how the objective depends on the choice of the last component and the corresponding votes since the parameters of the t − 1 previous components along with their votes have already been set and won’t be modified further. 144.

slide-146
SLIDE 146

We will first try to find the new component or parameters θ so as to maximize its potential in reducing the empirical loss, potential in the sense that we can subsequently adjust the votes to actually reduce the empirical loss. More precisely, we set θ so as to minimize the derivative ∂ ∂αJt(α, θ)|α=0 = 1 m

m

  • i=1

∂ ∂αLoss(yift−1(xi) + yi α h(xi; θ))|α=0 = 1 m

m

  • i=1

dL(yift−1(xi)) yi h(xi; θ), (13) where dL(z)

not.

= ∂Loss(z) ∂z . Note that this derivative ∂ ∂αJt(α, θ)|α=0 precisely captures the amount by which we would start to reduce the empirical loss if we gradually increased the vote (α) for the new component with parameters θ. Minimizing this re- duction seems like a sensible estimation criterion for the new component or θ. This plan permits us to first set θ and then subsequently optimize α to actually minimize the empirical loss. 145.

slide-147
SLIDE 147

Let’s rewrite the algorithm slightly to make it look more like a boosting al-

  • gorithm. First, let’s define the following weights and normalized weights on

the training examples: W (t−1)

i

= −dL(yift−1(xi)), for and ˜ W (t−1)

i

= W (t−1)

i

m

j=1 W (t−1) j

, for i = 1, . . . , m. These weights are guaranteed to be non-negative since the loss function is a decreasing function of its argument (its derivative has to be negative or zero). 146.

slide-148
SLIDE 148

Now we can rewrite the expression (13) as ∂ ∂αJt(α, θ)|α=0 = − 1 m

m

  • i=1

W (t−1)

i

yi h(xi; θ) = − 1 m

j

W (t−1)

j

  • ·

m

  • i=1

W (t−1)

i

  • j W (t−1)

j

yi h(xi; θ) = − 1 m

j

W (t−1)

j

  • ·

m

  • i=1

˜ W (t−1)

i

yi h(xi; θ). By ignoring the multiplicative constant (i.e., 1 m

  • j W (t−1)

j

, which is constant at iteration t), we will estimate θ by minimizing −

m

  • i=1

˜ W (t−1)

i

yih(xi; θ), (14) where the normalized weights ˜ W (t−1)

i

sum to 1. (This is the same as max- imizing the weighted agreement with the labels, i.e., m

i=1 ˜

W (t−1)

i

yih(xi; θ).) 147.

slide-149
SLIDE 149

Some remarks (by Liviu Ciortuz)

  • 1. Using some familiar notations, we can write

m

  • i=1

˜ W (t−1)

i

yih(xi; θ) =

  • i∈C

˜ W (t−1)

i

yih(xi; θ) +

  • i∈M

˜ W (t−1)

i

yih(xi; θ) =

  • i∈C

˜ W (t−1)

i

  • 1−εt

  • i∈M

˜ W (t−1)

i

  • εt

= 1 − 2εt ⇒ εt = 1 2

  • 1 −

m

  • i=1

˜ W (t−1)

i

yih(xi; ˆ θt)

  • 2. Because ˜

W (t−1)

i

≥ 0 and m

i=1 ˜

W (t−1)

i

= 1, it follows that m

i=1 ˜

W (t−1)

i

yih(xi; θ) ∈ [−1, +1], therefore 1 − (

i∈C ˜

W (t−1)

i

i∈M ˜

W (t−1)

i

) = 1 −

m

  • i=1

˜ W (t−1)

i

yih(xi; ˆ θt)

  • ∈[−1,+1]

∈ [0, +2], and so 1 2

  • 1 − m

i=1 ˜

W (t−1)

i

yih(xi; ˆ θt)

  • ∈ [0, +1].

148.

slide-150
SLIDE 150

We are now ready to cast the steps of the boosting algorithm in a form similar to the AdaBoost algorithm given at CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5. [Assume ˜ Wi(0) = 1 m and f0(xi) = 0 for i = 1, . . . .m.] Step 1: Find any classifier h(x; ˆ θt) that performs better than chance with respect to the weighted training error: εt = 1 2

  • 1 −

m

  • i=1

˜ W (t−1)

i

yih(xi; ˆ θt)

  • .

(15) Step 2: Set the votes αt for the new component by minimizing the overall empirical loss: Jt(α, ˆ θt) = 1 m

m

  • i=1

Loss(yi ft−1(xi) + yi α h(xi; ˆ θt)), and so αt = arg min

α≥0 Jt(α, ˆ

θt). Step 3: Recompute the normalized weights for the next iteration according to ˜ W (t)

i

= −ct · dL(yi ft−1(xi) + yi αt h(xi; ˆ θt)

  • yi ft(xi)

) for i = 1, . . . , m, (16) where ct is chosen so that m

i=1 ˜

W (t)

i

= 1. 149.

slide-151
SLIDE 151

One more remark (by Liviu Ciortuz), now concerning Step 1:

Normally there should be such εt ∈ (0, 1/2) (in fact, some corresponding ˆ θt), because if for some h we would have εt ∈ (1/2, 1), then we can take h′ = −h, and the resulting ε′

t would belong to (0, 1/2).

The are only two exceptions, which correspond to the case when for any hypothesis h we would have − either εt = 1/2, in which case

i∈C ˜

W (t−1)

i

=

i∈M ˜

W (t−1)

i

− or εt ∈ {0, 1}, in which case either h or h′ = −h is a perfect (therefore not weak) classifier for the given training data. 150.

slide-152
SLIDE 152

Exemplifying Step 1 on data from CMU, 2012 fall, T. Mitchell, Z. Bar-Joseph, final, pr. 8.a-e

[graphs made by MSc student Sebastian Ciobanu, FII, 2018 fall] Iteration 1

−10 −5 5 10 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 theta dJ(+|−)_with_alpha_zero −10 −5 5 10 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 theta dJ(−|+)_with_alpha_zero

151.

slide-153
SLIDE 153

[graphs made by MSc student Sebastian Ciobanu, FII, 2018 fall] Iteration 2

−10 −5 5 10 −0.4 −0.2 0.0 0.2 0.4 theta dJ(+|−)_with_alpha_zero −10 −5 5 10 −0.4 −0.2 0.0 0.2 0.4 theta dJ(−|+)_with_alpha_zero

152.

slide-154
SLIDE 154

[graphs made by MSc student Sebastian Ciobanu, FII, 2018 fall] Iteration 3

−10 −5 5 10 −0.4 −0.2 0.0 0.2 theta dJ(+|−)_with_alpha_zero −10 −5 5 10 −0.2 0.0 0.2 0.4 theta dJ(−|+)_with_alpha_zero

153.

slide-155
SLIDE 155
  • a. Show that the three steps in the algorithm correspond exactly to AdaBoost

when the loss function is the exponential loss Loss(z) = exp(−z). More precisely, show that in this case the setting of αt based on the new weak classifier and the weight update to get ˜ W (t)

i

would be identical to AdaBoost. (In CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5, ˜ W (t)

i

corresponds to Dt+1(i).)

Solution

For the first part, we will show that the minimization in Step 2 of the general algorithm (LHS below), with Loss(z) = e−z, is the same as the minimization performed by AdaBoost (RHS below), i.e. that arg min

α>0 m

  • i=1

Loss(yift−1(xi) + αyih(xi; ˆ θt)) = arg min

α>0 m

  • i=1

˜ W (t−1)

i

exp(−αyih(xi; ˆ θt)), with (from AdaBoost) ˜ W (t−1)

i

= ct−1 · exp(−yift−1(xi)), where ct−1 is a normalization constant (weights sum to 1). 154.

slide-156
SLIDE 156

Solution (cont’d)

Evaluating the objective in LHS gives:

m

  • i=1

Loss(yift−1(xi) + αyih(xi; ˆ θt)) =

m

  • i=1

exp(−yift−1(xi)) exp(−αyih(xi; ˆ θt))

(16)

= 1 ct−1

m

  • i=1

˜ W (t−1)

i

exp(−αyih(xi; ˆ θt)) = 1 ct−1

  • i∈{i|yih(xi;ˆ

θt)=1}

  • ˜

W (t−1)

i

  • e−α +
  • i∈{i|yih(xi;ˆ

θt)=−1}

  • ˜

W (t−1)

i

, which is proportional to the objective minimized by AdaBoost (see CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1). Therefore, minimizing w.r.t the value

  • f α is the same for both algorithms.

For the second part, note that the weight assignment in Step 3 of the general algorithm (for stage t) is ˜ W (t)

i

= −ct · dL(yift(xi)) = ct · exp(−yift(xi)), which is the same as in AdaBoost (see CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.2). 155.

slide-157
SLIDE 157
  • b. Show that for any valid loss function of the type discussed above, the new component

h(x; ˆ θt) just added at the t-th iteration would have weighted training error exactly 1/2 relative to the updated weights ˜ W (t)

i

.

Solution

At stage t, αt is chosen to minimize Jt(α, ˆ θt), i.e. to solve ∂Jt(α, ˆ θt) ∂α = 0. In general, ∂ ∂αJt(α, ˆ θt) = 1 m

m

  • i=1

∂ ∂αLoss(yift−1(xi) + yiαh(xi; ˆ θt)) = 1 m

m

  • i=1

dL(yift−1(xi) + yiαh(xi; ˆ θt))

˜ W (t)

i

ct yih(xi; ˆ θt) ∝

m

  • i=1

˜ W (t)

i

yih(xi; ˆ θt), so that we must have ˜ W (t)

i

yih(xi; ˆ θt) = 0.Then, the weighted training error for h(x; ˆ θt) (relative to the updated weights ˜ W (t)

i

determined by αt) can be computed in a similarly way to (15): 1 2

  • 1 −

m

  • i=1

˜ W (t)

i

yih(xi; ˆ θt)

  • = 1

2(1 − 0) = 1 2. 156.

slide-158
SLIDE 158

CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1 xxx CMU, 2008 fall, Eric Xing, midterm, pr. 5.1

  • c. Now, suppose that we change the objective function to Jt = m

i=1(yi − ft(xi))2 and we

still want to optimize it sequentially.a What is the new update rule for αt?

Solution

We will compute the derivative of Jt = m

i=1(yi − ft(xi))2 w.r.t. αt and set it to zero to

find the value of αt. ∂Jt ∂αt = ∂ m

i=1(yi − ft(xi))2

∂αt =

m

  • i=1

2(yi − ft(xi))∂(yi − ft(xi)) ∂αt . We also know that ft(xi) = ft−1(xi) + αtht(xi). In this equation, ft−1 is independent of αt. Substituting this in the derivative equation, we get ∂Jt ∂αt = 2

m

  • i=1

(yi − ft(xi))∂(yi − ft−1(xi) − αtht(xi)) ∂αt = 2

m

  • i=1

(yi − ft(xi))(−ht(xi))

aLC: Note that (yi − f(xi))2 = [yi(1 − yift(xi))]2 = (1 − yift(xi))2 = (1 − zi)2, where zi = yift(xi). The function

(1 − z)2 is derivable and convex; it is decreasing on (−∞, 1] and increasing on [1, +∞).

157.

slide-159
SLIDE 159

Solution (cont’d)

Setting the derivative to zero, we get ∂Jt ∂αt = 0 ⇔

m

  • i=1

(yi − ft(xi))ht(xi) = 0 ⇔

m

  • i=1

(yi − αtht(xi) − ft−1(xi))ht(xi) = 0

m

  • i=1

(yi − ft−1(xi))ht(xi) = αt

m

  • i=1

h2

t(xi) ⇔ αt =

m

i=1(yi − ft−1(xi))ht(xi)

m

i=1 h2 t(xi) 1

⇔ αt = 1 m

m

  • i=1

(yi − ft−1(xi))ht(xi). 158.

slide-160
SLIDE 160

MIT, 2006 fall, Tommi Jaakkola, HW4, pr. 3.a xxx MIT, 2009 fall, Tommi Jaakkola, HW3, pr. 2.1

  • d. Show that if we use the logistic loss instead [of the exponential loss] the

unnormalized weights W (t)

i

are bounded by 1.

Solution

W (t)

i

was defined as −dL(yift(xi)), with dL(z)

not.

= ∂ ∂z (ln(1 + e−z)). Therefore, W (t)

i

= e−zi 1 + e−zi = 1 1 + ezi < 1, where zi = yift(xi). 159.

slide-161
SLIDE 161
  • e. When using the logistic loss, what are the normalized weights, ˜

W (t)

i

? Express the weights as a function of the agreements yift(xi), where we have already included the t-th weak learner. What can you say about the resulting normalized weights for examples that are clearly misclassified in comparison to those that are just slightly misclassified by the current ensemble? If the training data contains mislabeled examples, why do we prefer the logistic loss

  • ver the exponential loss, Loss(z) = exp(−z)?

Solution

The normalized weights are given by ˜ W (t)

i

= ct · exp(−yift(xi)) 1 + exp(−yift(xi)),with the normalization constant ct = m

i=1

exp(−yift(xi)) 1 + exp(−yift(xi)) −1 . [Answer from MIT, 2011 fall, Leslie P. Kaelbling, HW5, pr. 1.1] For clearly misclassified examples, yi ft(xi) is a large negative, so W (t)

i

is close to [and less than] 1, while for slightly misclassifed examples, W (t)

i

is close to [and greater than] 1/2. Thus, the normalized weights for the two respective cases will be in a ratio of at most 2 : 1, i.e. a single clearly misclassifed outlier will never be worth more than two completely uncertain points. This is why boosting with logistic loss function is robust to outliers. 160.

slide-162
SLIDE 162

Solution (cont’d) in Romanian

LC: Pentru ultima parte de la punctul e nu am g˘ asit deloc r˘ aspuns la MIT, ˆ ıns˘ a pot gˆ andi astfel: ˆ In cazul funct ¸iei de pierdere logistice, dac˘ a avem un xi care ar avea de drept eticheta yi = +1, dar se consider˘ a (ˆ ın mod eronat) yi = −1, pierderea este de aproximativ ft(xi) dac˘ a ft(xi) > 0,a ˆ ın vreme ce ˆ ın cazul funct ¸iei de pierdere [negativ] exponent ¸iale pierderea este exp(ft(xi)) care este ˆ ın general mult mai mare decˆ at ft(xi). Cazurile simetrice (ft(xi) ≤ 0 ¸ si apoi yi = −1 → +1) se trateaz˘ a ˆ ın mod similar.

aVedet

¸i graficul funct ¸iei de pierdere logistic˘ a din enunt ¸ul problemei de fat ¸˘ a.

161.

slide-163
SLIDE 163
  • L. Ciortuz, 2020
  • f. Suppose we use logistic loss. What is the update rule for αt?

Solution (in Romanian); initially written by S

¸tefan Matcovici (MSc student) ˆ In loc s˘ a minimiz˘ am Jt ˆ ın raport cu argumentul α, vom ˆ ıncerca s˘ a minimiz˘ am ˆ ın raport cu α o margine superioar˘ a pentru diferent ¸a dintre Jt ¸ si J∗

t−1, unde

J∗

t−1 = not.

= minh∈H,α∈R+ Jt−1(h, α). (Spre deosebire de Jt care depinde de α, J∗

t−1

nu depinde de α. Prin urmare, a minimiza Jt ˆ ın raport cu α este echivalent cu a minimiza Jt − J∗

t−1 ˆ

ın raport cu α.) 162.

slide-164
SLIDE 164

Jt − J∗

t−1 = m

  • i=1

ln(1 + exp(−yift(xi))) −

m

  • i=1

ln(1 + exp(−yift−1(xi))) =

m

  • i=1

ln 1 + exp(−yift(xi)) 1 + exp(−yift−1(xi)) =

m

  • i=1

ln 1 +✭✭✭✭✭✭✭

exp(−yift−1(xi)) −✭✭✭✭✭✭✭

exp(−yift−1(xi)) + exp(−yift(xi)) 1 + exp(−yift−1(xi)) =

m

  • i=1

ln

  • 1 + exp(−yift(xi)) − exp(−yift−1(xi))

1 + exp(−yift−1(xi))

  • =

m

  • i=1

ln

  • 1 + exp(−yift−1(xi) − yiαh(xi)) − exp(−yift−1(xi))

1 + exp(−yift−1(xi))

  • =

m

  • i=1

ln

  • 1 + exp(−yift−1(xi)) exp(−yiαh(xi)) − exp(−yift−1(xi))

1 + exp(−yift−1(xi))

  • =

m

  • i=1

ln

  • 1 + exp(−yift−1(xi))[exp(−yiαh(xi)) − 1]

1 + exp(−yift−1(xi))

  • =

m

  • i=1

ln    1 + exp(−yiαh(xi)) − 1 1 exp(−yift−1(xi)) + 1     =

m

  • i=1

ln

  • 1 + exp(−yiαh(xi)) − 1

exp(yift−1(xi)) + 1

  • (17)

m

  • i=1

exp(−yiαh(xi)) − 1 exp(yift−1(xi)) + 1 fiindc˘ a ln(1 + z) ≤ z, ∀z > −1. (18)

163.

slide-165
SLIDE 165

Remarcat ¸i faptul c˘ a logaritmul din expresia (17) exist˘ a, ˆ ıntrucˆ at exp(−yiαh(xi)) − 1 exp(yift−1(xi)) + 1 > −1 ⇔ exp(−yiαh(xi)) − 1 > − exp(yift−1(xi)) − 1 ⇔ exp(−yiαh(xi))

  • >0

> − exp(yift−1(xi))

  • <0

, inegalitate adev˘ arat˘ a pentru ∀α. Expresia (18) pe care tocmai am obt ¸inut-o mai sus este marginea superioar˘ a (engl., upper bound) pentru Jt − Jt−1 pe care o vom minimiza ˆ ın raport cu α.

164.

slide-166
SLIDE 166

Un calcul simplu ne arat˘ a c˘ a ∂ ∂z ln(1 + e−z) = −e−z 1 + e−z = − 1 1 e−z + 1 = − 1 ez + 1. (19) S ¸tim din enunt ¸ c˘ a W (t−1)

i

= −dL(yift−1(xi)). Conform relat ¸iei (19), rezult˘ a c˘ a W (t−1)

i

= −

1 exp(yift−1(xi)) + 1

  • =

1 exp(yift−1(xi)) + 1, iar ˜ W (t−1)

i

= ct−1 · 1 exp(yift−1(xi)) + 1, unde ct−1 este constanta de normalizare a ponderilor W (t−1)

i

.

165.

slide-167
SLIDE 167

Prin urmare, Jt − Jt−1 ≤

m

  • i=1

1 exp(yift−1(xi)) + 1(exp(−yiαh(xi)) − 1) = 1 ct−1

m

  • i=1

˜ W (t−1)

i

· (exp(−yiαh(xi)) − 1) ∝

m

  • i=1

˜ W (t−1)

i

· exp(−yiαh(xi)) −

m

  • i=1

˜ W (t−1)

i

  • 1

. A¸ sadar, a minimiza marginea superioar˘ a pentru Jt − Jt−1 revine la a minimiza (ˆ ın raport cu α) expresia m

i=1 ˜

W (t−1)

i

· exp(−yiαh(xi)), pe care am ˆ ıntˆ alnit-o ¸ si la opti- mizarea costului [negativ] exponent ¸ial. Putem conchide c˘ a ¸ si ˆ ın cazul funct ¸iei de cost logistice putem alege αt = 1 2 ln 1 − εt εt .

166.

slide-168
SLIDE 168

MIT, 2006 fall, Tommi Jaakkola, HW4, pr. 3.b g. Suppose again that we use logistic loss and the training set is linearly

  • separable. We would like to use a linear support vector machine (no slack

penalties) as a base classifier [LC: or any linear separator consitent with the training data]. Assume that the generalized AdaBoost algorithm minimizes the εt weighted error at Step 1. In the first boosting iteration, what would the resulting α1 be? 167.

slide-169
SLIDE 169

Solution

In Step 1, we pick θ1. We wish to find θ1 to minimize ∂Jt(α, θ) ∂α |α=0. Equivalently, this θ1 is chosen to minimize the weighted sum: 2ε1 − 1 = − m

i=1 ˜

W0(i)yih(xi; θ), where ˜ W0(i) = 1 m for all i = 1, 2, . . . , m. If the train- ing set is linearly separable with offset, then the no-slack SVM problem is

  • feasible. Hence, the base classifier in this case will thus be an affine (linear

with offset) separator h(·; θ1), which satisfies the inequality yih(xi; θ1) ≥ 1 for all i = 1, 2, . . . , m. In Step 2, we pick α1 to minimize Jt(α1, θ1) = 1 m m

i=1 L(yih0(xi)+α1yih(xi; θ1)) =

1 m m

i=1 L(α1yih(xi; θ1)). Note that Jt(α1, θ1) is a sum of terms that are strictly

decreasing in α1 (as yih(xi; θ1) ≥ 1); therefore, it itself is also strictly decreasing in α1. It follows that the boosting algorithm with logistic loss will take α1 = ∞ in order to minimize Jt(α1, θ1). This makes sense, because if we can find a base classifier that perfectly sepa- rates the data, we will weight it as much as we can to minimize the boosting

  • loss. The lesson here is simple: when doing boosting, we need to use base

classifiers that are not powerful enough to perfectly separate the data. 168.