Statistical Machine Learning A Crash Course Part III: Boosting - - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part III: Boosting - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

Combining Classifiers ■ Horse race prediction: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2

Combining Classifiers ■ How do we make money from horse racing bets? ■ Ask a professional. ■ It is very likely that... • The professional cannot give a single highly accurate rule. • But presented with a set of races, can always generate better- than-random rules. ■ Can you get rich? ■ Disclaimer: We are not saying you should actually try this at home :-) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3

Combining Classifiers ■ Idea: • Ask an expert for their rule-of-thumb. • Assemble the set of cases where the rule-of-thumb fails (hard cases). • Ask the expert again for the selected set of hard cases. • And so on… ■ Combine many rules-of-thumb. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4

Combining Classifiers ■ How do we actually do this? ■ How to choose races on each round? • Concentrate on “hardest” races (those most often misclassified by previous rules of thumb) ■ How to combine rules of thumb into single prediction rule? • Take (weighted) majority vote of several h t : R d → { +1 , − 1 } rules-of-thumb • We take a weighted average of simple rules (models): � T ⇥ ⇤ H ( x ) = sign α t h t ( x ) t =1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 5

Boosting ■ General method of converting rough rules of thumb into a highly accurate prediction rule. ■ More formally: • Given a “weak” learning algorithm that can consistently find ≤ 1 “weak classifiers” with a (training) error of 2 − γ • A boosting algorithm can provably construct a “strong classifier” that has a training error of . ≤ � ■ As long as we have a “weak” learning algorithm that does better than chance, we can convert it into an algorithm that performs arbitrarily well! Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 6

AdaBoost: Toy Example ■ Training data: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7

AdaBoost: Toy Example ■ Round 1: 1st weak reweighted classifier training data Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 8

AdaBoost: Toy Example ■ Round 2: 1st weak 2nd weak reweighted classifier classifier training data Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9

AdaBoost: Toy Example ■ Round 3: 1st weak 2nd weak 3rd weak classifier classifier classifier Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10

AdaBoost: Toy Example ■ Weighted combination: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 11

AdaBoost: Toy Example ■ Final hypothesis / “strong” classifier: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 12

AdaBoost ■ Given: Training data with labels ( x 1 , y 1 ) , . . . , ( x N , y N ) x i ∈ R d , y i ∈ { +1 , − 1 } where ■ Initialize weights for every data point: D 1 ( i ) = 1 N # of boosting rounds ■ Loop over : t = 1 , . . . , T h t : R d → { +1 , − 1 } • Train the weak learner on the training data so that the weighted error with weights is minimized. D t α t ∈ R + • Choose an appropriate weight for the weak classifier. • Update the data weights as 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } where is chosen such that sums to 1. Z t D t +1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 13

AdaBoost ■ Given: Training data with labels ( x 1 , y 1 ) , . . . , ( x N , y N ) x i ∈ R d , y i ∈ { +1 , − 1 } where ■ Return the weighted (“strong”, ensemble) classifier: � T ⇥ ⇤ H ( x ) = sign α t h t ( x ) t =1 ■ Intuition: • Boosting uses weighted training data and adapts the weights every round. • The weights make the algorithm focus on the wrongly classified examples: � ⇤ 1 if y i ⌅ = h t ( x i ) exp { � α t y i h t ( x i ) } = ⇥ 1 if y i = h t ( x i ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 14

AdaBoost: Weak Learners ■ Training the weak learner: ( x 1 , y 1 ) , . . . , ( x N , y N ) • Given training data D t ( i ) • and weights for all data point. • Select the weak classifier with the smallest weighted error: N � � t = D t ( i )[ y i � = h ( x i )] h t = arg min with h ∈ H � t i =1 ⇥ t ≤ 1 • Prerequisite: Weighted training error � t > 0 2 − � t , ■ Examples for : H • Weighted least-squares classifier • Decision stumps (hold on...) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 15

AdaBoost: Weak Learners Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 16

AdaBoost: Weak Learners Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 17

AdaBoost ■ How do we select ? α t ■ We want to minimize the empirical error: N 1 � � tr ( H ) = M [ y i � = H ( x i )] i =1 ■ The empirical error can be upper bounded: � N ⇥ T T ⌅ ⇤ � = D t ( i ) exp { − α t y i h t ( x i ) } � tr ( H ) ≤ Z t t =1 t =1 i =1 [Freund & Schapire] ■ To minimize the empirical error, we can greedily minimize in each round. Z t Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18

AdaBoost ■ Select by greedily minimizing in each round. Z t ( α ) α t • Minimizes an upper bound on the empirical error. N ■ Minimize � Z t ( α ) = D t ( i ) exp { − α y i h t ( x i ) } i =1 ■ We obtain the AdaBoost weighting: � 1 − ⇥ t ⇥ � t = 1 2 log ⇥ t N � � t = D t ( i )[ y i � = h ( x i )] with i =1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 19

AdaBoost: Reweighting 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 20

AdaBoost: Reweighting 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Increase the weight on incorrectly classified examples Decrease the weight on correctly classified examples � ⇤ 1 if y i ⌅ = h t ( x i ) exp { � α t y i h t ( x i ) } = ⇥ 1 if y i = h t ( x i ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 21

AdaBoost: Reweighting ■ Eventually only the very difficult cases will be focused on: Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 22

AdaBoost: More realistic example t = 0 ■ Initialize... Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 23

AdaBoost: More realistic example t = 1 ■ Initialize... ■ For t = 1 , . . . , T • Find h t = arg min h ∈ H � t � t > 1 • Stop if 2 � 1 − ⇥ t ⇥ � t = 1 • Set 2 log ⇥ t • Reweight the data: 1 D t +1 ( i ) = Z t D t ( i ) exp { − α t y i h t ( x i ) } Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 24

Statistical Machine Learning A Crash Course Part III: Boosting - - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part III: Boosting - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Combining Classifiers Horse race prediction: Stefan Roth, 11.05.2012 | Department of Computer

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Prediction Markets Friday, April 22, 2016 Instructor: Chris Callison-Burch TA: Ellie Pavlick

GAMING POLICY AND ENFORCEMENT BRANCH SAM MACLEOD, ASSISTANT DEPUTY MINISTER AND GENERAL MANAGER

PRODUCT(CONCEPT:(EQUISENSE( Accelerated(detec@on(of(equine(leg(injury. ( ( !!!!!!!!Protects!

Hilary The Most Poisoned Baby Name in US History Like horse-racing, Hillary-hating has

Logic Programming Examples Temur Kutsia Research Institute for Symbolic Computation Johannes

REMITTANCES INWARD THERE IS NO RESTRICTION ON RECEIVING REMITTANCES IN INDIA FROM ANY

Autonomous Movement IMGD 4000 With material from: Millington

Haloscope at Yale Sensitive to Axion CDM (HAYSTAC) Overview and Phase II Upgrades 3rd Workshop