Machine Learning for Signal Processing Detecting faces (& other - PowerPoint PPT Presentation

Boosting • The basic idea: Can a “weak” learning algorithm that performs just slightly better than a random guess be boosted into an arbitrarily accurate “strong” learner • This is a “meta” algorithm, that poses no constraints on the form of the weak learners themselves 11755/18979 40

Boosting: A Voting Perspective • Boosting is a form of voting – Let a number of different classifiers classify the data – Go with the majority – Intuition says that as the number of classifiers increases, the dependability of the majority vote increases • Boosting by majority • Boosting by weighted majority – A (weighted) majority vote taken over all the classifiers – How do we compute weights for the classifiers? – How do we actually train the classifiers 11755/18979 41

ADA Boost • Challenge: how to optimize the classifiers and their weights? – Trivial solution: Train all classifiers independently – Optimal: Each classifier focuses on what others missed – But joint optimization becomes impossible • Ada ptive Boost ing: Greedy incremental optimization of classifiers – Keep adding classifiers incrementally, to fix what others missed 11755/18979 42

AdaBoost ILLUSTRATIVE EXAMPLE 11755/18979 43

AdaBoost First WEAK Learner 11755/18979 44

AdaBoost The First Weak Learner makes Errors 11755/18979 45

AdaBoost Reweighted data 11755/18979 46

AdaBoost SECOND Weak Learner FOCUSES ON DATA “MISSED” BY FIRST LEARNER 11755/18979 47

AdaBoost SECOND STRONG Learner Combines both Weak Learners 11755/18979 48

AdaBoost RETURNING TO THE SECOND WEAK LEARNER 11755/18979 49

AdaBoost The SECOND Weak Learner makes Errors 11755/18979 50

AdaBoost Reweighting data 11755/18979 51

THIRD Weak AdaBoost Learner FOCUSES ON DATA “MISSED” BY FIRST AND SECOND LEARNERs 11755/18979 52

AdaBoost THIRD STRONG Learner 11755/18979 53

Boosting: An Example • Red dots represent training data from Red class • Blue dots represent training data from Blue class 11755/18979 54

Boosting: An Example • The final strong learner has learnt a complicated decision boundary 11755/18979 55

Boosting: An Example • The final strong learner h as learnt a complicated decision boundary • Decision boundaries in areas with low density of training points assumed inconsequential 11755/18979 56

Overall Learning Pattern  Strong learner increasingly accurate with increasing number of weak learners  Residual errors increasingly difficult to correct ‒ Additional weak learners less and less effective Error of n th weak learner Error of n th strong learner number r of weak k learn rners rs 11755/18979 57

Overfitting  Note: Can continue to add weak learners EVEN after strong learner error goes to 0!  Shown to IMPROVE generalization! Error of n th weak learner This may go to 0 Error of n th strong learner number r of weak k learn rners rs 11755/18979 58

AdaBoost: Summary • No relation to Ada Lovelace • Adaptive Boosting • Adaptively Selects Weak Learners • ~8K citations for just one paper for Freund and Schapire 11755/18979 59

The ADABoost Algorithm • Initialize D 1 ( x i ) = 1/ N • For t = 1, …, T – Train a weak classifier h t using distribution D t – Compute total error on training data • e t = Sum { D t (x i ) ½(1 – y i h t ( x i ))} – Set  t = ½ ln ((1 – e t ) / e t ) – For i = 1… N • set D t +1 ( x i ) = D t ( x i ) exp(-  t y i h t ( x i )) – Normalize D t +1 to make it a distribution • The final classifier is – H ( x ) = sign( S t  t h t ( x )) 11755/18979 60

First, some example data = 0.3 E1 - 0.6 E2 = 0.2 E1 + 0.4 E2 = 0.5 E1 - 0.5 E2 = -0.8 E1 - 0.1 E2 E 1 = 0.7 E1 - 0.1 E2 = 0.4 E1 - 0.9 E2 = 0.6 E1 - 0.4 E2 = 0.2 E1 + 0.5 E2 E 2 Image = a*E1 + b*E2  a = Image.E1 • Face detection with multiple Eigen faces • Step 0: Derived top 2 Eigen faces from Eigen face training data • Step 1: On a (different) set of examples, express each image as a linear combination of Eigen faces – Examples include both faces and non faces – Even the non-face images are explained in terms of the Eigen faces 11755/18979 61

Training Data D A = 0.3 E1 - 0.6 E2 = 0.2 E1 + 0.4 E2 E B = 0.5 E1 - 0.5 E2 = -0.8 E1 - 0.1 E2 F C = 0.7 E1 - 0.1 E2 = 0.4 E1 - 0.9 E2 G D = 0.6 E1 - 0.4 E2 = 0.2 E1 + 0.5 E2 ID E1 E2. Class A 0.3 -0.6 +1 Face = +1 B 0.5 -0.5 +1 Non-face = -1 C 0.7 -0.1 +1 D 0.6 -0.4 +1 E 0.2 0.4 -1 F -0.8 -0.1 -1 G 0.4 -0.9 -1 H 0.2 0.5 -1 11755/18979 62

Initialize D 1 ( x i ) = 1/ N 11755/18979 64

Training Data = 0.3 E1 - 0.6 E2 = 0.2 E1 + 0.4 E2 = 0.5 E1 - 0.5 E2 = -0.8 E1 - 0.1 E2 = 0.7 E1 - 0.1 E2 = 0.4 E1 - 0.9 E2 = 0.6 E1 - 0.4 E2 = 0.2 E1 + 0.5 E2 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 11755/18979 65

The ADABoost Algorithm • Initialize D 1 ( x i ) = 1/ N • For t = 1, …, T – Train a weak classifier h t using distribution D t – Compute total error on training data • e t = Sum { D t (x i ) ½(1 – y i h t ( x i ))} – Set  t = ½ ln ( e t /(1 – e t )) – For i = 1… N • set D t +1 ( x i ) = D t ( x i ) exp(-  t y i h t ( x i )) – Normalize D t +1 to make it a distribution • The final classifier is – H ( x ) = sign( S t  t h t ( x )) 11755/18979 66

The E1 “Stump” Classifier based on E1: F E H A G B C D if ( sign*wt(E1) > thresh) > 0) face = true -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 sign = +1 or -1 threshold Sign = +1, error = 3/8 Sign = -1, error = 5/8 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 11755/18979 67

The E1 “Stump” Classifier based on E1: if ( sign*wt(E1) > thresh) > 0) F E H A G B C D face = true -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 sign = +1 or -1 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 threshold Sign = +1, error = 3/8 Sign = -1, error = 5/8 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 11755/18979 71

The Best E1 “Stump” F E H A G B C D Classifier based on E1: if ( sign*wt(E1) > thresh) > 0) -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 face = true 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 Sign = +1 threshold Threshold = 0.45 Sign = +1, error = 1/8 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 11755/18979 77

The E2“Stump” Classifier based on E2: if ( sign*wt(E2) > thresh) > 0) Note order G A B D C F E H face = true -0.9 -0.6 -0.5 -0.4 -0.1 -0.1 0.4 0.5 sign = +1 or -1 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 threshold Sign = +1, error = 3/8 Sign = -1, error = 5/8 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 11755/18979 78

The Best E2“Stump” Classifier based on E2: if ( sign*wt(E2) > thresh) > 0) G A B D C F E H face = true -0.9 -0.6 -0.5 -0.4 -0.1 -0.1 0.4 0.5 sign = -1 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 Threshold = 0.15 threshold Sign = -1, error = 2/8 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 11755/18979 79

The Best “Stump” F E H A G B C D The Best overall classifier based on a single feature is -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 based on E1 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 If (wt(E1) > 0.45)  Face threshold Sign = +1, error = 1/8 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 11755/18979 80

The Best “Stump” 11755/18979 81

The ADABoost Algorithm • Initialize D 1 ( x i ) = 1/ N • For t = 1, …, T – Train a weak classifier h t using distribution D t – Compute total error on training data • e t = Sum { D t (x i ) ½(1 – y i h t ( x i ))} – Set  t = ½ ln ( e t /(1 – e t )) – For i = 1… N – • set D t +1 ( x i ) = D t ( x i ) exp(-  t y i h t ( x i )) – Normalize D t +1 to make it a distribution • The final classifier is – H ( x ) = sign( S t  t h t ( x )) 11755/18979 82

The Best “Stump” 11755/18979 83

The Best Error F E H A G B C D The Error of the classifier is the sum of the weights of -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 the misclassified instances 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 threshold Sign = +1, error = 1/8 ID E1 E2. Class Weight A 0.3 -0.6 +1 1/8 B 0.5 -0.5 +1 1/8 C 0.7 -0.1 +1 1/8 D 0.6 -0.4 +1 1/8 E 0.2 0.4 -1 1/8 F -0.8 -0.1 -1 1/8 G 0.4 -0.9 -1 1/8 H 0.2 0.5 -1 1/8 NOTE: THE ERROR IS THE SUM OF THE WEIGHTS OF MISCLASSIFIED INSTANCES 11755/18979 84

Computing Alpha F E H A G B C D Alpha = 0.5ln((1-1/8) / (1/8)) -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 = 0.5 ln(7) = 0.97 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 threshold Sign = +1, error = 1/8 11755/18979 86

The Boosted Classifier Thus Far F E H A G B C D Alpha = 0.5ln((1-1/8) / (1/8)) -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 = 0.5 ln(7) = 0.97 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 threshold Sign = +1, error = 1/8 h1(X) = wt(E1) > 0.45 ? +1 : -1 H(X) = sign(0.97 * h1(X)) It’s the same as h1(x) 11755/18979 87

The ADABoost Algorithm • Initialize D 1 ( x i ) = 1/ N • For t = 1, …, T – Train a weak classifier h t using distribution D t – Compute total error on training data • e t = Average {½ (1 – y i h t ( x i ))} – Set  t = ½ ln ((1 – e t ) / e t ) – For i = 1… N • set D t +1 ( x i ) = D t ( x i ) exp(-  t y i h t ( x i )) – Normalize D t +1 to make it a distribution • The final classifier is – H ( x ) = sign( S t  t h t ( x )) 11755/18979 88

The Best Error F E H A G B C D D t +1 ( x i ) = D t ( x i ) exp(-  t y i h t ( x i )) -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 exp(  t ) = exp(0.97) = 2.63 exp(-  t ) = exp(-0.97) = 0.38 threshold ID E1 E2. Class Weight Weight A 0.3 -0.6 +1 1/8 * 2.63 0.33 B 0.5 -0.5 +1 1/8 * 0.38 0.05 C 0.7 -0.1 +1 1/8 * 0.38 0.05 D 0.6 -0.4 +1 1/8 * 0.38 0.05 E 0.2 0.4 -1 1/8 * 0.38 0.05 F -0.8 0.1 -1 1/8 * 0.38 0.05 G 0.4 -0.9 -1 1/8 * 0.38 0.05 H 0.2 0.5 -1 1/8 * 0.38 0.05 Multiply the correctly classified instances by 0.38 Multiply incorrectly classified instances by 2.63 11755/18979 89

AdaBoost 11755/18979 90

AdaBoost 11755/18979 91

The ADABoost Algorithm • Initialize D 1 ( x i ) = 1/ N • For t = 1, …, T – Train a weak classifier h t using distribution D t – Compute total error on training data • e t = Average {½ (1 – y i h t ( x i ))} – Set  t = ½ ln ((1 – e t ) / e t ) – For i = 1… N • set D t +1 ( x i ) = D t ( x i ) exp(-  t y i h t ( x i )) – Normalize D t +1 to make it a distribution • The final classifier is – H ( x ) = sign( S t  t h t ( x )) 11755/18979 92

The Best Error F E H A G B C D D’ = D / sum(D) -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 threshold ID E1 E2. Class Weight Weight Weight A 0.3 -0.6 +1 1/8 * 2.63 0.33 0.48 B 0.5 -0.5 +1 1/8 * 0.38 0.05 0.074 C 0.7 -0.1 +1 1/8 * 0.38 0.05 0.074 D 0.6 -0.4 +1 1/8 * 0.38 0.05 0.074 E 0.2 0.4 -1 1/8 * 0.38 0.05 0.074 F -0.8 0.1 -1 1/8 * 0.38 0.05 0.074 G 0.4 -0.9 -1 1/8 * 0.38 0.05 0.074 H 0.2 0.5 -1 1/8 * 0.38 0.05 0.074 Multiply the correctly classified instances by 0.38 Multiply incorrectly classified instances by 2.63 Normalize to sum to 1.0 11755/18979 93

The Best Error F E H A G B C D D’ = D / sum(D) -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 threshold ID E1 E2. Class Weight A 0.3 -0.6 +1 0.48 B 0.5 -0.5 +1 0.074 C 0.7 -0.1 +1 0.074 D 0.6 -0.4 +1 0.074 E 0.2 0.4 -1 0.074 F -0.8 0.1 -1 0.074 G 0.4 -0.9 -1 0.074 H 0.2 0.5 -1 0.074 Multiply the correctly classified instances by 0.38 Multiply incorrectly classified instances by 2.63 Normalize to sum to 1.0 11755/18979 94

The ADABoost Algorithm • Initialize D 1 ( x i ) = 1/ N • For t = 1, …, T – Train a weak classifier h t using distribution D t – Compute total error on training data • e t = Average {½ (1 – y i h t ( x i ))} – Set  t = ½ ln ( e t /(1 – e t )) – For i = 1… N • set D t +1 ( x i ) = D t ( x i ) exp(-  t y i h t ( x i )) – Normalize D t +1 to make it a distribution • The final classifier is – H ( x ) = sign( S t  t h t ( x )) 11755/18979 95

E1 classifier Classifier based on E1: if ( sign*wt(E1) > thresh) > 0) F E H A G B C D face = true -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 sign = +1 or -1 .074 .074 .074 .48 .074 .074 .074 .074 threshold Sign = +1, error = 0.222 Sign = -1, error = 0.778 ID E1 E2. Class Weight A 0.3 -0.6 +1 0.48 B 0.5 -0.5 +1 0.074 C 0.7 -0.1 +1 0.074 D 0.6 -0.4 +1 0.074 E 0.2 0.4 -1 0.074 F -0.8 0.1 -1 0.074 G 0.4 -0.9 -1 0.074 H 0.2 0.5 -1 0.074 11755/18979 96

E1 classifier Classifier based on E1: if ( sign*wt(E1) > thresh) > 0) F E H A G B C D face = true -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 sign = +1 or -1 .074 .074 .074 .48 .074 .074 .074 .074 threshold Sign = +1, error = 0.148 Sign = -1, error = 0.852 ID E1 E2. Class Weight A 0.3 -0.6 +1 0.48 B 0.5 -0.5 +1 0.074 C 0.7 -0.1 +1 0.074 D 0.6 -0.4 +1 0.074 E 0.2 0.4 -1 0.074 F -0.8 0.1 -1 0.074 G 0.4 -0.9 -1 0.074 H 0.2 0.5 -1 0.074 11755/18979 97

The Best E1 classifier Classifier based on E1: if ( sign*wt(E1) > thresh) > 0) F E H A G B C D face = true -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 sign = +1 or -1 .074 .074 .074 .48 .074 .074 .074 .074 threshold Sign = +1, error = 0.074 ID E1 E2. Class Weight A 0.3 -0.6 +1 0.48 B 0.5 -0.5 +1 0.074 C 0.7 -0.1 +1 0.074 D 0.6 -0.4 +1 0.074 E 0.2 0.4 -1 0.074 F -0.8 0.1 -1 0.074 G 0.4 -0.9 -1 0.074 H 0.2 0.5 -1 0.074 11755/18979 98

The Best E2 classifier Classifier based on E2: if ( sign*wt(E2) > thresh) > 0) G A B D C F E H face = true -0.9 -0.6 -0.5 -0.4 -0.1 -0.1 0.4 0.5 sign = +1 or -1 .074 .48 .074 .074 .074 .074 .074 .074 threshold Sign = -1, error = 0.148 ID E1 E2. Class Weight A 0.3 -0.6 +1 0.48 B 0.5 -0.5 +1 0.074 C 0.7 -0.1 +1 0.074 D 0.6 -0.4 +1 0.074 E 0.2 0.4 -1 0.074 F -0.8 0.1 -1 0.074 G 0.4 -0.9 -1 0.074 H 0.2 0.5 -1 0.074 11755/18979 99

The Best Classifier Classifier based on E1: F E H A G B C D if (wt(E1) > 0.45) face = true -0.8 0.2 0.2 0.3 0.4 0.5 0.6 0.7 .074 .074 .074 .48 .074 .074 .074 .074 Alpha = 0.5ln((1-0.074) / 0.074) threshold = 1.26 Sign = +1, error = 0.074 ID E1 E2. Class Weight A 0.3 -0.6 +1 0.48 B 0.5 -0.5 +1 0.074 C 0.7 -0.1 +1 0.074 D 0.6 -0.4 +1 0.074 E 0.2 0.4 -1 0.074 F -0.8 0.1 -1 0.074 G 0.4 -0.9 -1 0.074 H 0.2 0.5 -1 0.074 11755/18979 100

Machine Learning for Signal Processing Detecting faces (& other - PowerPoint PPT Presentation

Machine Learning for Signal Processing Detecting faces (& other objects) in images Class 7. 22 Sep 2015 11755/18979 1 Last Lecture: How to describe a face The typical face A typical face that captures the essence of

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Machine Learning for Signal Processing Lecture 1: Signal Representations Class 1. 27 August

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Waveform Generation Fundamental part of signal processing is the signal. Within the

Sampling a Signal an analog signal together with some samples of the signal. The samples

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Machine Learning for Signal Processing Lecture 1: Signal Representations Class 1. 29 August

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Problem and model selection and model selection Elisabeth Gnatowski Elisabeth Gnatowski

Stochastic Simulation Idea: probabilities samples Get probabilities from samples: X count X

EDDIE: EM-Based Detection of Deviations in Program Execution Published at ISCA 2017 Alireza

Understanding Quality Control: A Process Improvement Perspective Robert L. Schmidt MD, PhD, MBA

The Finite-Set Independence Criterion (FSIC) Zoltn Szab Arthur Gretton Wittawat Jitkrittum

Discrete Probabilistic Programming from First Principles Guy Van den Broeck The Fourth

Implementing the LeybourneTaylor test for seasonal unit roots in Stata Christopher F Baum

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian