HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus - PowerPoint PPT Presentation

HW1 • Grades our out • Total: 180 • Min: 55 • Max: 188(178+10 for bonus credit) • Average: 174.24 • Median: 178 • std: 18.225 1

Top5 on HW1 1. Curtis, Josh (score: 188, test accuracy: 0.9598) 2. Huang, Waylon (score: 180, test accuracy: 0.8202) 3. Luckey, Royden (score: 180, test accuracy: 0.8192) 4. Luo, Mathew Han (score: 180, test accuracy: 0.8174) 5. Shen, Dawei (score: 180, test accuracy: 0.8130) 2

CSE446: Ensemble Learning - Bagging and Boosting Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Nick Kushmerick, Padraig Cunningham, and Luke Zettlemoyer

Voting (Ensemble Methods) • Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data • Output class: (Weighted) vote of each classifier – Classifiers that are most “sure” will vote with more conviction – Classifiers will be most “sure” about a particular part of the space – On average, do better than single classifier! • But how??? – force classifiers to learn about different parts of the input space? different subsets of the data? – weigh the votes of different classifiers?

BAGGing = Bootstrap AGGregation (Breiman, 1996) • for i = 1, 2, …, K: – T i  randomly select M training instances with replacement – h i  learn(T i ) [Decision Tree, Naive Bayes, …] • Now combine the h i together with uniform voting (w i =1/K for all i)

decision tree learning algorithm; very similar to version in earlier slides 9

shades of blue/red indicate strength of vote for particular classification

Fighting the bias-variance tradeoff • Simple (a.k.a. weak) learners are good – e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) – Low variance, don’t usually overfit • Simple (a.k.a. weak) learners are bad – High bias, can’t solve hard learning problems • Can we make weak learners always good??? – No!!! – But often yes…

Boosting [Schapire, 1989] • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote • On each iteration t : – weight each training example by how incorrectly it was classified – Learn a hypothesis – h t – A strength for this hypothesis –  t • Final classifier: • Practically useful • Theoretically interesting

time = 0 blue/red = class size of dot = weight weak learner = Decision stub: horizontal or vertical li 14

time = 1 this hypothesis has 15% error and so does this ensemble, since the ensemble contains just this one hypothesis 15

time = 2 16

time = 3 17

time = 13 18

time = 100 19

time = 300 overfitting! 20

Learning from weighted data • Consider a weighted dataset – D(i) – weight of i th training example ( x i ,y i ) – Interpretations: • i th training example counts as if it occurred D(i) times • If I were to “resample” data, I would get more samples of “heavier” data points • Now, always do weighted calculations: – e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count: – setting D(j)=1 (or any constant value!), for all j, will recreates unweighted case

Given: Initialize: How? Many possibilities. Will For t=1…T: see one shortly! • Train base classifier h t (x) using D t Why? Reweight the data: • Choose α t examples i that are misclassified will have • Update, for i=1..m: higher weights! with normalization constant: • y i h t (x i ) > 0  h i correct • y i h t (x i ) < 0  h i wrong • h i correct, α t > 0  D t+1 (i) < D t (i) Output final classifier: • h i wrong, α t > 0  D t+1 (i) > D t (i) Final Result: linear sum of “base” or “weak” classifier outputs.

Given: Initialize: For t=1…T: • Train base classifier h t (x) using D t • Choose α t • Update, for i=1..m: • ε t : error of h t , weighted by D t • 0 ≤ ε t ≤ 1 • α t : α t • No errors: ε t =0  α t =∞ • All errors: ε t = 1  α t =−∞ • Random: ε t = 0.5  α t =0 ε t

What  t to choose for hypothesis h t ? [Schapire, 1989] Idea: choose  t to minimize a bound on training error! Where

What  t to choose for hypothesis h t ? [Schapire, 1989] Idea: choose  t to minimize a bound on training error! Where This equality isn’t And obvious! Can be shown with algebra (telescoping sums)! If we minimize  t Z t , we minimize our training error!!! • We can tighten this bound greedily, by choosing  t and h t on each iteration to minimize Z t . • h t is estimated as a black box, but can we solve for  t ?

Summary: choose  t to minimize error bound [Schapire, 1989] We can squeeze this bound by choosing  t on each iteration to minimize Z t . For boolean Y: differentiate, set equal to 0, there is a closed form solution! [Freund & Schapire ’97]:

Given: Initialize: For t=1…T: • Train base classifier h t (x) using D t • Choose α t • Update, for i=1..m: with normalization constant: Output final classifier:

Initialize: Use decision stubs as base classifier For t=1…T: Initial: • • Train base classifier h t (x) using D t D 1 = [D 1 (1), D 1 (2), D 1 (3)] = [.33,.33,.33] • t=1: Choose α t • Train stub [work omitted, breaking ties randomly] • h 1 (x)=+1 if x 1 >0.5, -1 otherwise • ε 1 =Σ i D 1 (i) δ (h 1 (x i )≠ y i ) = 0.33 × 1+0.33 × 0+0.33 × 0=0.33 • Update, for i=1..m: • α 1 =(1/2) ln((1- ε 1 )/ε 1 )=0.5 × ln(2)= 0.35 • D 2 (1) α D 1 (1) × exp(- α 1 y 1 h 1 (x 1 )) Output final classifier : = 0.33 × exp(-0.35 × 1 × -1) = 0.33 × exp(0.35) = 0.46 • D 2 (2) α D 1 (2) × exp(- α 1 y 2 h 1 (x 2 )) = 0.33 × exp(-0.35 × -1 × -1) = 0.33 × exp(-0.35) = 0.23 • D 2 (3) α D 1 (3) × exp(- α 1 y 3 h 1 (x 3 )) x 1 y = 0.33 × exp(-0.35 × 1 × 1) = 0.33 × exp(-0.35) =0.23 • D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] -1 1 t=2 • Continues on next slide! x 1 0 -1 1 1 H(x) = sign(0.35 × h 1 (x)) • h 1 (x)=+1 if x 1 >0.5, -1 otherwise

• D 2 = [D 1 (1), D 1 (2), D 1 (3)] = [0.5,0.25,0.25] Initialize: t=2: For t=1…T: • Train stub [work omitted; different stub because of • Train base classifier h t (x) using D t new data weights D; breaking ties opportunistically • Choose α t (will discuss at end)] • h 2 (x)=+1 if x 1 <1.5, -1 otherwise • ε 2 =Σ i D 2 (i) δ (h 2 (x i )≠ y i ) = 0.5 × 0+0.25 × 1+0.25 × 0=0.25 • α 2 =(1/2) ln((1- ε 2 )/ε 2 )=0.5 × ln(3)= 0.55 • Update, for i=1..m: • D 2 (1) α D 1 (1) × exp(- α 2 y 1 h 2 (x 1 )) = 0.5 × exp(-0.55 × 1 × 1) = 0.5 × exp(-0.55) = 0.29 Output final classifier : • D 2 (2) α D 1 (2) × exp(- α 2 y 2 h 2 (x 2 )) = 0.25 × exp(-0.55 × -1 × 1) = 0.25 × exp(0.55) = 0.43 • D 2 (3) α D 1 (3) × exp(- α 2 y 3 h 2 (x 3 )) = 0.25 × exp(-0.55 × 1 × 1) = 0.25 × exp(-0.55) = 0.14 x 1 y • D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] t=3 -1 1 • Continues on next slide! x 1 0 -1 1 1 H(x) = sign(0.35 × h 1 (x)+0.55 × h 2 (x)) • h 1 (x)=+1 if x 1 >0.5, -1 otherwise • h 2 (x)=+1 if x 1 <1.5, -1 otherwise

• D 3 = [D 3 (1), D 3 (2), D 3 (3)] = [0.33,0.5,0.17] Initialize: t=3: For t=1…T: • Train stub [work omitted; different stub • Train base classifier h t (x) using D t because of new data weights D; breaking ties • Choose α t opportunistically (will discuss at end)] • h 3 (x)=+1 if x 1 <-0.5, -1 otherwise • ε 3 =Σ i D 3 (i) δ (h 3 (x i )≠ y i ) = 0.33 × 0+0.5 × 0+0.17 × 1=0.17 • Update, for i=1..m: • α 3 =(1/2) ln((1- ε 3 )/ε 3 )=0.5 × ln(4.88)= 0.79 Output final classifier : • Stop!!! How did we know to stop? x 1 y -1 1 x 1 0 -1 1 1 H(x) = sign(0.35 × h 1 (x)+0.55 × h 2 (x)+0.79 × h 3 (x)) • h 1 (x)=+1 if x 1 >0.5, -1 otherwise • h 2 (x)=+1 if x 1 <1.5, -1 otherwise • h 3 (x)=+1 if x 1 <-0.5, -1 otherwise

Strong, weak classifiers • If each classifier is (at least slightly) better than random:  t < 0.5 • Another bound on error: • What does this imply about the training error? – Will reach zero! – Will get there exponentially fast! • Is it hard to achieve better than random training error?

Boosting results – Digit recognition [Schapire, 1989] Test error Training error • Boosting: – Seems to be robust to overfitting – Test error can decrease even after training error is zero!!!

Boosting generalization error bound [Freund & Schapire, 1996] Constants: • T : number of boosting rounds – Higher T  Looser bound • d : measures complexity of classifiers – Higher d  bigger hypothesis space  looser bound • m : number of training examples – more data  tighter bound

Boosting generalization error bound [Freund & Schapire, 1996] Constants: Theory does not match practice : • • T : number of boosting rounds: • Robust to overfitting – Higher T  Looser bound, what does this imply? • Test set error decreases even after training error is • d : VC dimension of weak learner, measures zero complexity of classifier Need better analysis tools • – Higher d  bigger hypothesis space  looser bound • we’ll come back to this later in the quarter • m : number of training examples – more data  tighter bound

HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus - PowerPoint PPT Presentation

HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus credit) Average: 174.24 Median: 178 std: 18.225 1 Top5 on HW1 1. Curtis, Josh (score: 188, test accuracy: 0.9598) 2. Huang, Waylon (score: 180, test

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1

CS 171: Introduction to Computer Science II Algorithm Analysis Li Xiong Today Hw1

HW1 Challenge: Measure and understand a software program VPR: CAD software developed at

Lecture 2: Number Systems Logistics Webpage is up! http://www.cs.washington.edu/370 HW1

Lecture 2 - HW1,numpy arrays, matplotlib, and git 2020.4.14 Review of Python Basics from Lecture

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

Words & Pictures Tamara Berg Features Announcements HW1

Lecture 3 Big-O notation, more recurrences!! Announcements! HW1 is posted! (Due Friday)

Optimization-Based Meta-Learning CS 330 1 Course Reminders HW1 due next Weds (9/30). Project

Announcements: HW1 due today 11:59p. PA1 due 02/03, 11:59p. Quizzes Warm-up: Weird Mystery

HW1 Graded Ready in your pendaflexes If you forget to include his name, e-mail Brandon.

DTTF/NB479: Dszquphsbqiz Day 2 Announcements: Subscribe to piazza and start HW1 Questions?

Administrivia Carnegie Mellon Univ. HW1 is due today. Dept. of Computer Science HW2 is

Introduction to Java Graphics Check out IntroToJavaGraphics from SVN Open your HW1 project

Network Administration HW1 zswu Computer Center, CS, NCTU Purposes The goal is to build an

Opera&ng Systems Bryce Boe 2012/08/07 CS32, Summer 2012 B

digital Stadium Delay Tolerant Networks in the Real World Overview Introduce the team and the

Chapter 8 Forecasting Demand Qualitative Forecasting Methods Moving Averages and Smoothing

T i me S e r i e s D a t a b a s e s a n d S t r e a mi n g a l g o

Time series causality inference using the Phase Slope Index. Florin Popescu Guido Nolte

Electric Vehicle Infrastructure Programs October 23, 2019 1 Overview Intro to SVCE

Fighting the Reliability Problem or Who Cares about Routing in WSN! by Pawel Gburzynski Popular

Employment Security Department Update August 3, 2020 Milestones Met for Operation 100% by 7/31!

Introduction to Computer Graphics April 9, 2020 Kenshi Takayama Course overview Lecture

HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus - PowerPoint PPT Presentation

HW1 Grades our out Total: 180 Min: 55 Max: 188(178+10 for bonus credit) Average: 174.24 Median: 178 std: 18.225 1 Top5 on HW1 1. Curtis, Josh (score: 188, test accuracy: 0.9598) 2. Huang, Waylon (score: 180, test

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1

CS 171: Introduction to Computer Science II Algorithm Analysis Li Xiong Today Hw1

HW1 Challenge: Measure and understand a software program VPR: CAD software developed at

Lecture 2: Number Systems Logistics Webpage is up! http://www.cs.washington.edu/370 HW1

Lecture 2 - HW1,numpy arrays, matplotlib, and git 2020.4.14 Review of Python Basics from Lecture

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

Words &amp; Pictures Tamara Berg Features Announcements HW1

Lecture 3 Big-O notation, more recurrences!! Announcements! HW1 is posted! (Due Friday)

Optimization-Based Meta-Learning CS 330 1 Course Reminders HW1 due next Weds (9/30). Project

Announcements: HW1 due today 11:59p. PA1 due 02/03, 11:59p. Quizzes Warm-up: Weird Mystery

HW1 Graded Ready in your pendaflexes If you forget to include his name, e-mail Brandon.

DTTF/NB479: Dszquphsbqiz Day 2 Announcements: Subscribe to piazza and start HW1 Questions?

Administrivia Carnegie Mellon Univ. HW1 is due today. Dept. of Computer Science HW2 is

Introduction to Java Graphics Check out IntroToJavaGraphics from SVN Open your HW1 project

Network Administration HW1 zswu Computer Center, CS, NCTU Purposes The goal is to build an

Opera&amp;ng Systems Bryce Boe 2012/08/07 CS32, Summer 2012 B

digital Stadium Delay Tolerant Networks in the Real World Overview Introduce the team and the

Chapter 8 Forecasting Demand Qualitative Forecasting Methods Moving Averages and Smoothing

T i me S e r i e s D a t a b a s e s a n d S t r e a mi n g a l g o

Time series causality inference using the Phase Slope Index. Florin Popescu Guido Nolte

Electric Vehicle Infrastructure Programs October 23, 2019 1 Overview Intro to SVCE

Fighting the Reliability Problem or Who Cares about Routing in WSN! by Pawel Gburzynski Popular

Employment Security Department Update August 3, 2020 Milestones Met for Operation 100% by 7/31!

Introduction to Computer Graphics April 9, 2020 Kenshi Takayama Course overview Lecture

Words & Pictures Tamara Berg Features Announcements HW1

Opera&ng Systems Bryce Boe 2012/08/07 CS32, Summer 2012 B