hw1
play

HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus - PowerPoint PPT Presentation

HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus credit) Average: 174.24 Median: 178 std: 18.225 1 Top5 on HW1 1. Curtis, Josh (score: 188, test accuracy: 0.9598) 2. Huang, Waylon (score: 180, test


  1. HW1 • Grades our out • Total: 180 • Min: 55 • Max: 188(178+10 for bonus credit) • Average: 174.24 • Median: 178 • std: 18.225 1

  2. Top5 on HW1 1. Curtis, Josh (score: 188, test accuracy: 0.9598) 2. Huang, Waylon (score: 180, test accuracy: 0.8202) 3. Luckey, Royden (score: 180, test accuracy: 0.8192) 4. Luo, Mathew Han (score: 180, test accuracy: 0.8174) 5. Shen, Dawei (score: 180, test accuracy: 0.8130) 2

  3. CSE446: Ensemble Learning - Bagging and Boosting Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Nick Kushmerick, Padraig Cunningham, and Luke Zettlemoyer

  4. 4

  5. 5

  6. Voting (Ensemble Methods) • Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data • Output class: (Weighted) vote of each classifier – Classifiers that are most “sure” will vote with more conviction – Classifiers will be most “sure” about a particular part of the space – On average, do better than single classifier! • But how??? – force classifiers to learn about different parts of the input space? different subsets of the data? – weigh the votes of different classifiers?

  7. BAGGing = Bootstrap AGGregation (Breiman, 1996) • for i = 1, 2, …, K: – T i  randomly select M training instances with replacement – h i  learn(T i ) [Decision Tree, Naive Bayes, …] • Now combine the h i together with uniform voting (w i =1/K for all i)

  8. 8

  9. decision tree learning algorithm; very similar to version in earlier slides 9

  10. shades of blue/red indicate strength of vote for particular classification

  11. Fighting the bias-variance tradeoff • Simple (a.k.a. weak) learners are good – e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) – Low variance, don’t usually overfit • Simple (a.k.a. weak) learners are bad – High bias, can’t solve hard learning problems • Can we make weak learners always good??? – No!!! – But often yes…

  12. Boosting [Schapire, 1989] • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote • On each iteration t : – weight each training example by how incorrectly it was classified – Learn a hypothesis – h t – A strength for this hypothesis –  t • Final classifier: • Practically useful • Theoretically interesting

  13. time = 0 blue/red = class size of dot = weight weak learner = Decision stub: horizontal or vertical li 14

  14. time = 1 this hypothesis has 15% error and so does this ensemble, since the ensemble contains just this one hypothesis 15

  15. time = 2 16

  16. time = 3 17

  17. time = 13 18

  18. time = 100 19

  19. time = 300 overfitting! 20

  20. Learning from weighted data • Consider a weighted dataset – D(i) – weight of i th training example ( x i ,y i ) – Interpretations: • i th training example counts as if it occurred D(i) times • If I were to “resample” data, I would get more samples of “heavier” data points • Now, always do weighted calculations: – e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count: – setting D(j)=1 (or any constant value!), for all j, will recreates unweighted case

  21. Given: Initialize: How? Many possibilities. Will For t=1…T: see one shortly! • Train base classifier h t (x) using D t Why? Reweight the data: • Choose α t examples i that are misclassified will have • Update, for i=1..m: higher weights! with normalization constant: • y i h t (x i ) > 0  h i correct • y i h t (x i ) < 0  h i wrong • h i correct, α t > 0  D t+1 (i) < D t (i) Output final classifier: • h i wrong, α t > 0  D t+1 (i) > D t (i) Final Result: linear sum of “base” or “weak” classifier outputs.

  22. Given: Initialize: For t=1…T: • Train base classifier h t (x) using D t • Choose α t • Update, for i=1..m: • ε t : error of h t , weighted by D t • 0 ≤ ε t ≤ 1 • α t : α t • No errors: ε t =0  α t =∞ • All errors: ε t = 1  α t =−∞ • Random: ε t = 0.5  α t =0 ε t

  23. What  t to choose for hypothesis h t ? [Schapire, 1989] Idea: choose  t to minimize a bound on training error! Where

  24. What  t to choose for hypothesis h t ? [Schapire, 1989] Idea: choose  t to minimize a bound on training error! Where This equality isn’t And obvious! Can be shown with algebra (telescoping sums)! If we minimize  t Z t , we minimize our training error!!! • We can tighten this bound greedily, by choosing  t and h t on each iteration to minimize Z t . • h t is estimated as a black box, but can we solve for  t ?

  25. Summary: choose  t to minimize error bound [Schapire, 1989] We can squeeze this bound by choosing  t on each iteration to minimize Z t . For boolean Y: differentiate, set equal to 0, there is a closed form solution! [Freund & Schapire ’97]:

  26. Given: Initialize: For t=1…T: • Train base classifier h t (x) using D t • Choose α t • Update, for i=1..m: with normalization constant: Output final classifier:

  27. Initialize: Use decision stubs as base classifier For t=1…T: Initial: • • Train base classifier h t (x) using D t D 1 = [D 1 (1), D 1 (2), D 1 (3)] = [.33,.33,.33] • t=1: Choose α t • Train stub [work omitted, breaking ties randomly] • h 1 (x)=+1 if x 1 >0.5, -1 otherwise • ε 1 =Σ i D 1 (i) δ (h 1 (x i )≠ y i ) = 0.33 × 1+0.33 × 0+0.33 × 0=0.33 • Update, for i=1..m: • α 1 =(1/2) ln((1- ε 1 )/ε 1 )=0.5 × ln(2)= 0.35 • D 2 (1) α D 1 (1) × exp(- α 1 y 1 h 1 (x 1 )) Output final classifier : = 0.33 × exp(-0.35 × 1 × -1) = 0.33 × exp(0.35) = 0.46 • D 2 (2) α D 1 (2) × exp(- α 1 y 2 h 1 (x 2 )) = 0.33 × exp(-0.35 × -1 × -1) = 0.33 × exp(-0.35) = 0.23 • D 2 (3) α D 1 (3) × exp(- α 1 y 3 h 1 (x 3 )) x 1 y = 0.33 × exp(-0.35 × 1 × 1) = 0.33 × exp(-0.35) =0.23 • D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] -1 1 t=2 • Continues on next slide! x 1 0 -1 1 1 H(x) = sign(0.35 × h 1 (x)) • h 1 (x)=+1 if x 1 >0.5, -1 otherwise

  28. • D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] Initialize: t=2: For t=1…T: • Train stub [work omitted; different stub because of • Train base classifier h t (x) using D t new data weights D; breaking ties opportunistically • Choose α t (will discuss at end)] • h 2 (x)=+1 if x 1 <1.5, -1 otherwise • ε 2 =Σ i D 2 (i) δ (h 2 (x i )≠ y i ) = 0.5 × 0+0.25 × 1+0.25 × 0=0.25 • α 2 =(1/2) ln((1- ε 2 )/ε 2 )=0.5 × ln(3)= 0.55 • Update, for i=1..m: • D 2 (1) α D 1 (1) × exp(- α 2 y 1 h 2 (x 1 )) = 0.5 × exp(-0.55 × 1 × 1) = 0.5 × exp(-0.55) = 0.29 Output final classifier : • D 2 (2) α D 1 (2) × exp(- α 2 y 2 h 2 (x 2 )) = 0.25 × exp(-0.55 × -1 × 1) = 0.25 × exp(0.55) = 0.43 • D 2 (3) α D 1 (3) × exp(- α 2 y 3 h 2 (x 3 )) = 0.25 × exp(-0.55 × 1 × 1) = 0.25 × exp(-0.55) = 0.14 x 1 y • D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] t=3 -1 1 • Continues on next slide! x 1 0 -1 1 1 H(x) = sign(0.35 × h 1 (x)+0.55 × h 2 (x)) • h 1 (x)=+1 if x 1 >0.5, -1 otherwise • h 2 (x)=+1 if x 1 <1.5, -1 otherwise

  29. • D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] Initialize: t=3: For t=1…T: • Train stub [work omitted; different stub • Train base classifier h t (x) using D t because of new data weights D; breaking ties • Choose α t opportunistically (will discuss at end)] • h 3 (x)=+1 if x 1 <-0.5, -1 otherwise • ε 3 =Σ i D 3 (i) δ (h 3 (x i )≠ y i ) = 0.33 × 0+0.5 × 0+0.17 × 1=0.17 • Update, for i=1..m: • α 3 =(1/2) ln((1- ε 3 )/ε 3 )=0.5 × ln(4.88)= 0.79 Output final classifier : • Stop!!! How did we know to stop? x 1 y -1 1 x 1 0 -1 1 1 H(x) = sign(0.35 × h 1 (x)+0.55 × h 2 (x)+0.79 × h 3 (x)) • h 1 (x)=+1 if x 1 >0.5, -1 otherwise • h 2 (x)=+1 if x 1 <1.5, -1 otherwise • h 3 (x)=+1 if x 1 <-0.5, -1 otherwise

  30. Strong, weak classifiers • If each classifier is (at least slightly) better than random:  t < 0.5 • Another bound on error: • What does this imply about the training error? – Will reach zero! – Will get there exponentially fast! • Is it hard to achieve better than random training error?

  31. Boosting results – Digit recognition [Schapire, 1989] Test error Training error • Boosting: – Seems to be robust to overfitting – Test error can decrease even after training error is zero!!!

  32. Boosting generalization error bound [Freund & Schapire, 1996] Constants: • T : number of boosting rounds – Higher T  Looser bound • d : measures complexity of classifiers – Higher d  bigger hypothesis space  looser bound • m : number of training examples – more data  tighter bound

  33. Boosting generalization error bound [Freund & Schapire, 1996] Constants: Theory does not match practice : • • T : number of boosting rounds: • Robust to overfitting – Higher T  Looser bound, what does this imply? • Test set error decreases even after training error is • d : VC dimension of weak learner, measures zero complexity of classifier Need better analysis tools • – Higher d  bigger hypothesis space  looser bound • we’ll come back to this later in the quarter • m : number of training examples – more data  tighter bound

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend