BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost - PowerPoint PPT Presentation

Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of   Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019

Last time… Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of bias and variance. http://scott.fortmann-roe.com/docs/BiasVariance.html 2

Last time… Bagging • Leo Breiman (1994) • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples, create D ’ by drawing N examples at random with replacement from D. • Bagging: - Create k bootstrap samples D 1 ... D k . - Train distinct classifier on each D i . - Classify new instance by majority vote / average. slide by David Sontag 3

Last time… Random Forests Tree t=1 t=2 t=3 slide by Nando de Freitas [From the book of Hastie, Friedman and Tibshirani] 4

Boosting 5

Boosting Ideas • Main idea: use weak learner to create strong learner. • Ensemble method: combine base classifiers returned by weak learner. • Finding simple relatively accurate base classifiers often not hard. • But, how should base classifiers be combined? slide by Mehryar Mohri 6

Example: “How May I Help You?” Goal: automatically categorize type of call requested by • phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I’d like to place a collect call long distance please (Collect) - operator I need to make a call but I need to bill it to my office (ThirdNumber) - yes I’d like to place a call on my master card please (CallingCard) - I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit) Observation: • - easy to find “rules of thumb” that are “often” correct slide by Rob Schapire • e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ” - hard to find single highly accurate prediction rule [Gorin et al.] 7

Boosting: Intuition • Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space • Output class: (Weighted) vote of each classifier - Classifiers that are most “sure” will vote with more conviction - Classifiers will be most “sure” about a particular part of the space - On average, do better than single classifier! • But how do you??? slide by Aarti Singh & Barnabas Poczos - force classifiers to learn about di ff erent parts of the input space? - weigh the votes of di ff erent classifiers? 8

Boosting [Schapire, 1989] • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote • On each iteration t : - weight each training example by how incorrectly it was classified - Learn a hypothesis – h t - A strength for this hypothesis – a t • Final classifier: - A linear combination of the votes of the di ff erent classifiers weighted by their strength slide by Aarti Singh & Barnabas Poczos • Practically useful • Theoretically interesting 9

Boosting: Intuition • Want to pick weak classifiers that contribute something to the ensemble Greedy algorithm: for m =1,..., M Pick a weak classifier h m • Adjust weights: misclassified • examples get “heavier” α m set according to weighted • error of h m slide by Raquel Urtasun [Source: G. Shakhnarovich] 10

First Boosting Algorithms • [Schapire ’89]: - first provable boosting algorithm • [Freund ’90]: - “optimal” algorithm that “boosts by majority” • [Drucker, Schapire & Simard ’92]: - first experiments using boosting - limited by practical drawbacks • [Freund & Schapire ’95]: - introduced “ AdaBoost ” algorithm - strong practical advantages over previous boosting slide by Rob Schapire algorithms 17

The AdaBoost Algorithm 18

Toy Example Minimize the error For binary h t , typically use slide by Rob Schapire weak hypotheses = vertical or horizontal half-planes 19

Round 1 h 1 ε 1=0.30 slide by Rob Schapire 20

Round 1 h 1 ε 1 =0.30 =0.42 α 1 slide by Rob Schapire 21

Round 1 h 1 D 2 ε 1 =0.30 =0.42 α 1 slide by Rob Schapire 22

Round 2 h 2 3 ε 2=0.21 slide by Rob Schapire 23

Round 2 h 2 3 ε 2 =0.21 =0.65 α slide by Rob Schapire 2 24

Round 2 h 2 D 3 ε 2 =0.21 =0.65 α slide by Rob Schapire 2 25

Round 3 h 3 ε 3=0.14 slide by Rob Schapire 26

Round 3 h 3 ε 3 =0.14 3=0.92 α slide by Rob Schapire 27

Final Hypothesis H = sign 0.42 + 0.65 + 0.92 final = slide by Rob Schapire 28

Voted combination of classifiers • The general problem here is to try to combine many simple “weak” classifiers into a single “strong” classifier • We consider voted combinations of simple binary ±1 component classifiers where the (non-negative) votes α i can be used to   emphasize component classifiers that are more   reliable than others slide by Tommi S. Jaakkola 29

Components: Decision stumps • Consider the following simple family of component classifiers generating ±1 labels: where These are called decision   stumps. • Each decision stump pays attention to only a single component of the input vector slide by Tommi S. Jaakkola 30

Voted combinations (cont’d.) • We need to define a loss function for the combination so we can determine which new component h (x; θ ) to add and how many votes it should receive   • While there are many options for the loss function we consider here only a simple exponential loss slide by Tommi S. Jaakkola 31

Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 32

Modularity, errors, and loss • Consider adding the m th component: slide by Tommi S. Jaakkola 33

Modularity, errors, and loss • Consider adding the m th component:   • So at the m th iteration the new component (and the votes) slide by Tommi S. Jaakkola should optimize a weighted loss (weighted towards mistakes). 34

Empirical exponential loss (cont’d.) • To increase modularity we’d like to further decouple the optimization of h (x; θ m ) from the associated votes α m • To this end we select h (x; θ m ) that optimizes the rate at which the loss would decrease as a function of α m slide by Tommi S. Jaakkola 35

  Empirical exponential loss (cont’d.) • We find that minimizes • We can also normalize the weights:   slide by Tommi S. Jaakkola so that 36

Empirical exponential loss (cont’d.) • We find that minimizes   where • is subsequently chosen to minimize slide by Tommi S. Jaakkola 37

38 The AdaBoost Algorithm slide by Jiri Matas and Jan Š ochman

The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } slide by Jiri Matas and Jan Š ochman 39

The AdaBoost Algorithm Given: ( x 1 , y 1 ) , . . . , ( x m , y m ); x i ∈ X , y i ∈ { − 1 , +1 } Initialise weights D 1 ( i ) = 1 /m slide by Jiri Matas and Jan Š ochman 40

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost - PowerPoint PPT Presentation

Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019 Last time Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

BBM406 Fundamentals of Machine Learning Lecture 14: Deep Convolutional Networks Aykut Erdem

Diplomata Belgica Analysing medieval charter texts ( dictamen ) through a quantitative approach

Me em mb br ra an ne e C Co om mp pu ut ti in ng g After T Twenty Y Years G Gh

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley

Data Collection Process Webinar March 1, 2018 Katie Dively and Jay Otto Centers Purpose We

Recent measurements of low-energy hadronic cross secCons at

Parnas Tables: An Experience with Formal Verification in an Industrial Setting Bill Kelly OPGI

Fundamental Symmetries in Nuclear Physics Electroweak Interactions at scales much lower than the

Human-Robot Interaction Elective in Artificial Intelligence Lecture 7 RGBD Perception Luca