Lecture 12: Perceptron and Back Propagation CS109A Introduction to - PowerPoint PPT Presentation

Lecture 12: Perceptron and Back Propagation CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

Outline 1. Review of Classification and Logistic Regression 2. Introduction to Optimization Gradient Descent – – Stochastic Gradient Descent 3. Single Neuron Network (‘Perceptron’) 4. Multi-Layer Perceptron 5. Back Propagation CS109A, P ROTOPAPAS , R ADER 2

Outline 1. Review of Classification and Logistic Regression 2. Introduction to Optimization Gradient Descent – – Stochastic Gradient Descent 3. Single Neuron Network (‘Perceptron’) 4. Multi-Layer Perceptron 5. Back Propagation CS109A, P ROTOPAPAS , R ADER 3

Classification and Logistic Regression CS109A, P ROTOPAPAS , R ADER 4

Classification M ethods that are centered around modeling and prediction of a quantitative response variable (ex, number of taxi pickups, number of bike rentals, etc) are called regressions (and Ridge, LASSO, etc). When the response variable is categorical , then the problem is no longer called a regression problem but is instead labeled as a classification problem . The goal is to attempt to classify each observation into a category (aka, class or cluster) defined by Y , based on a set of predictor variables X . CS109A, P ROTOPAPAS , R ADER 5

Heart Data response variable Y is Yes/No Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD typical fixed 63 1 145 233 1 2 150 0 2.3 3 0.0 No asymptomatic normal 67 1 160 286 0 2 108 1 1.5 2 3.0 Yes 67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2.0 reversable Yes 37 1 nonanginal 130 250 0 0 187 0 3.5 3 0.0 normal No 41 0 nontypical 130 204 0 2 172 0 1.4 1 0.0 normal No CS109A, P ROTOPAPAS , R ADER 6

Heart Data: logistic estimation We'd like to predict whether or not a person has a heart disease. And we'd like to make this prediction, for now, just based on the MaxHR. CS109A, P ROTOPAPAS , R ADER 7

Logistic Regression Logistic Regression addresses the problem of estimating a probability, 𝑄 𝑧 = 1 , given an input X . The logistic regression model uses a function, called the logistic function, to model 𝑄 𝑧 = 1 : e β 0 + β 1 X 1 P ( Y = 1) = 1 + e β 0 + β 1 X = 1 + e − ( β 0 + β 1 X ) CS109A, P ROTOPAPAS , R ADER 8

Logistic Regression As a result the model will predict 𝑄 𝑧 = 1 with an 𝑇 -shaped curve, which is the general shape of the logistic function. + , 𝛾 ( shifts the curve right or left by c = − + - . 𝛾 . controls how steep the 𝑇 -shaped curve is distance from ½ to ~1 or ½ \ to ~0 to ½ is 0 + - Note: if 𝛾 . is positive, then the predicted 𝑄 𝑧 = 1 goes from zero for small values of 𝑌 to one for large values of 𝑌 and if 𝛾 . is negative, then has the 𝑄 𝑧 = 1 opposite association. CS109A, P ROTOPAPAS , R ADER 9

Logistic Regression 2𝛾 . − 𝛾 ( 𝛾 . 𝛾 . 4 CS109A, P ROTOPAPAS , R ADER 10

Logistic Regression 1 P ( Y = 1) = 1 + e − ( β 0 + β 1 X ) CS109A, P ROTOPAPAS , R ADER 11

Logistic Regression 1 P ( Y = 1) = 1 + e − ( β 0 + β 1 X ) CS109A, P ROTOPAPAS , R ADER 12

� Estimating the coefficients for Logistic Regression Find the coefficients that minimize the loss function ℒ 𝛾 ( , 𝛾 . = − 6[𝑧 8 log 𝑞 8 + 1 − 𝑧 8 log(1 − 𝑞 8 )] 8 CS109A, P ROTOPAPAS , R ADER 13

But what is the idea? Start with Regression or Logistic Regression Classification Regression . 𝑋 N = 𝑋 f 𝑌 = 𝑋 N 𝑌 f(X) = ( , 𝑋 . , … , 𝑋 F .GH IJKL = [𝛾 ( , 𝛾 . , … , 𝛾 F ] 𝑦 . 𝑦 0 𝑍 = 𝑔(𝛾 ( + 𝛾 . 𝑦 . + 𝛾 0 𝑦 0 + 𝛾 E 𝑦 E + 𝛾 F 𝑦 F ) 𝑦 E 𝑦 F Intercept or Bias Coefficients or Weights CS109A, P ROTOPAPAS , R ADER 14

But what is the idea? Start with all randomly selected weights. Most likely it will perform horribly. For example, in our heart data, the model will be giving us the wrong answer. 𝑁𝑏𝑦𝐼𝑆 = 200 𝐵𝑕𝑓 = 52 Correct answer y=No 𝑞̂ = 0.9 → 𝑍𝑓𝑡 𝑇𝑓𝑦 = 𝑁𝑏𝑚𝑓 𝐷ℎ𝑝𝑚 = 152 Bad Computer CS109A, P ROTOPAPAS , R ADER 15

But what is the idea? Start with all randomly selected weights. Most likely it will perform horribly. For example, in our heart data, the model will be giving us the wrong answer. 𝑁𝑏𝑦𝐼𝑆 = 170 Correct answer 𝐵𝑕𝑓 = 42 y=Yes 𝑞̂ = 0.4 → 𝑂𝑝 𝑇𝑓𝑦 = 𝑁𝑏𝑚𝑓 𝐷ℎ𝑝𝑚 = 342 Bad Computer CS109A, P ROTOPAPAS , R ADER 16

But what is the idea? • Loss Function: Takes all of these results and averages them and tells us how bad or good the computer or those weights are. • Telling the computer how bad or good is, does not help. • You want to tell it how to change those weights so it gets better. Loss function: ℒ 𝑥 ( , 𝑥 . , 𝑥 0 , 𝑥 E , 𝑥 F For now let’s only consider one weight, ℒ 𝑥 . CS109A, P ROTOPAPAS , R ADER 17

Minimizing the Loss function Ideally we want to know the value of 𝑥 . that gives the minimul ℒ 𝑋 To find the optimal point of a function ℒ 𝑋 𝑒ℒ(𝑋) = 0 𝑒𝑋 And find the 𝑋 that satisfies that equation. Sometimes there is no explicit solution for that. CS109A, P ROTOPAPAS , R ADER 18

Minimizing the Loss function A more flexible method is • Start from any point Determine which direction to go to reduce the loss (left or right) • Specifically, we can calculate the slope of the function at this point • Shift to the right if slope is negative or shift to the left if slope is positive • • Repeat CS109A, P ROTOPAPAS , R ADER 19

Minimization of the Loss Function If the step is proportional to the slope then you avoid overshooting the minimum. Question : What is the mathematical function that describes the slope? Question : How do we generalize this to more than one predictor? Question: What do you think it is a good approach for telling the model how to change (what is the step size) to become better? CS109A, P ROTOPAPAS , R ADER 20

Minimization of the Loss Function If the step is proportional to the slope then you avoid overshooting the minimum. Question : What is the mathematical function that describes the slope? Derivative Question : How do we generalize this to more than one predictor? Take the derivative with respect to each coefficient and do the same sequentially Question: What do you think it is a good approach for telling the model how to change (what is the step size) to become better? More on this later CS109A, P ROTOPAPAS , R ADER 21

Let’s play the Pavlos game We know that we want to go in the opposite direction of the derivative and we know we want to be making a step proportionally to the derivative. Making a step means: 𝑥 gHh = 𝑥 ijk + 𝑡𝑢𝑓𝑞 Learning Rate Opposite direction of the derivative means: 𝑥 gHh = 𝑥 ijk − 𝜇 𝑒ℒ 𝑒𝑥 Change to more conventional notation: 𝑥 (8G.) = 𝑥 (8) − 𝜇 𝑒ℒ 𝑒𝑥 CS109A, P ROTOPAPAS , R ADER 22

Gradient Descent 𝑥 (8G.) = 𝑥 (8) − 𝜇 𝑒ℒ Algorithm for optimization of first • 𝑒𝑥 order to finding a minimum of a function. It is an iterative method. - + • L L is decreasing in the direction of • the negative derivative. The learning rate is controlled by • the magnitude of 𝜇 . w CS109A, P ROTOPAPAS , R ADER 23

Considerations • We still need to derive the derivatives. • We need to know what is the learning rate or how to set it. • We need to avoid local minima. • Finally, the full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, this can be hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER 24

Considerations • We still need to derive the derivatives. • We need to know what is the learning rate or how to set it. • We need to avoid local minima. • Finally, the full likelihood function includes summing up all individual ‘ errors’. Unless you are a statistician, this can be hundreds of thousands of examples. CS109A, P ROTOPAPAS , R ADER 25

Derivatives: Memories from middle school CS109A, P ROTOPAPAS , R ADER 26

Linear Regression X ( y i − β 0 − β 1 x i ) 2 f = i d f d f X = 0 ⇒ 2 ( y i − β 0 − β 1 x i )( − x i ) X = 0 ⇒ 2 ( y i − β 0 − β 1 x i ) d β 1 d β 0 i i X X X x 2 x i y i + β 0 x i + β 1 i = 0 X X − x i = 0 y i − β 0 n − β 1 i i i i i X X X x 2 x i y i + (¯ y − β 1 ¯ x ) x i + β 1 i = 0 β 0 = ¯ y − β 1 ¯ x − i i i X ! X x 2 x 2 i − n ¯ = x i y i − n ¯ x ¯ β 1 y i i P i x i y i − n ¯ x ¯ y ⇒ β 1 = P i x 2 i − n ¯ x 2 P i ( x i − ¯ x )( y i − ¯ y ) ⇒ β 1 = P i ( x i − ¯ x ) 2 CS109A, P ROTOPAPAS , R ADER 27

Logistic Regression Derivatives Can we do it? Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, P ROTOPAPAS , R ADER 28

� Chain Rule • Chain rule for computing gradients: • 𝑧 = 𝑕 𝑦 𝑨 = 𝑔 𝑧 = 𝑔 𝑕 𝑦 𝒛 = 𝑕 𝒚 𝑨 = 𝑔 𝒛 = 𝑔 𝑕 𝒚 𝜖𝑨 𝜖𝑦 = 𝜖𝑨 𝜖𝑧 𝜖𝑧 r 𝜖𝑨 = 6 𝜖𝑨 𝜖𝑧 𝜖𝑦 𝜖𝑦 8 𝜖𝑧 r 𝜖𝑦 8 r … ∂ y j m ∂ z ∂ z • For longer chains ∑ … ∑ = ∂ x i ∂ y j 1 ∂ x i j 1 j m CS109A, P ROTOPAPAS , R ADER 29

Lecture 12: Perceptron and Back Propagation CS109A Introduction to - PowerPoint PPT Presentation

Lecture 12: Perceptron and Back Propagation CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Outline 1. Review of Classification and Logistic Regression 2. Introduction to Optimization Gradient Descent Stochastic

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth,

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the

Lecture no: 2 Short on dB calculations Basics about antennas Propagation mechanisms

Sequential Data Modeling - The Structured Perceptron Graham Neubig Nara Institute of Science and

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

Orbital Space Plane Orbital Space Plane How Did We Get Here and Why? How Did We Get Here and

OPPORTUNITY AREAS Oh, I need my ipad, where is my baaaag located?! ??? zzz BAG BEHIND

CCT Connection Tool Fast, Effective Fiber Connector End-Face Cleaning CONNECTING AT THE SPEED

Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu xxu373@wisc.edu November 7th Why

r

Road Scene Understanding Presented by: Mohamed Mohsen Agenda Problem Definition Currently

IMPLEMENTATION OF A PARALLEL BATCH TRAINING ALGORITHM FOR DEEP NEURAL NETWORK YUPING LIN IFLYTEK

Sambuz

Useful Links

Newsletter

Mail Us