Machine Learning and Data Mining Multi-layer Perceptrons & - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Kalev Kask

Linear classifiers (perceptrons) • Linear Classifiers – a linear classifier is a mapping which partitions feature space using a linear function (a straight line, or a hyperplane) – separates the two classes using a straight line in feature space – in 2 dimensions the decision boundary is a straight line Linearly separable data Linearly non-separable data Decision boundary Feature 2, x 2 Feature 2, x 2 Decision boundary Feature 1, x 1 Feature 1, x 1 (c) Alexander Ihler

Perceptron Classifier (2 features) Classifier T(r) w 1 x 1 r “linear response” x 2 w 2 T(r) r = w 1 x 1 + w 2 x 2 + w 0 {-1, +1} Threshold or, {0, 1} w 0 output 1 weighted sum of the inputs Function = class decision r = X.dot( theta.T ); # compute linear response # ”sign”: predict +1 / -1 Yhat = 2*(r > 0)-1 Decision Boundary at r(x) = 0 Solve: X 2 = -w 1 /w 2 X 1 – w 0 /w 2 (Line) (c) Alexander Ihler

Perceptron Classifier (2 features) Classifier T(r) w 1 x 1 r “linear response” x 2 w 2 T(r) r = w 1 x 1 + w 2 x 2 + w 0 {-1, +1} Threshold or, {0, 1} w 0 output 1 weighted sum of the inputs Function = class decision r = X.dot( theta.T ); # compute linear response # ”sign”: predict +1 / -1 Yhat = 2*(r > 0)-1 T(r) = -1 if r < 0 1D example: T(r) = +1 if r > 0 Decision boundary = “ x such that T( w 1 x + w 0 ) transitions ” (c) Alexander Ihler

Features and perceptrons • Recall the role of features – We can create extra features that allow more complex decision boundaries – Linear classifiers – Features [1,x] • Decision rule: T(ax+b) = ax + b >/< 0 • Boundary ax+b =0 => point – Features [1,x,x 2 ] • Decision rule T(ax 2 +bx+c) • Boundary ax 2 +bx+c = 0 = ? – What features can produce this decision rule? (c) Alexander Ihler

Features and perceptrons • Recall the role of features – We can create extra features that allow more complex decision boundaries – For example, polynomial features Φ (x) = [1 x x 2 x 3 …] • What other kinds of features could we choose? – Step functions? Linear function of features a F1 + b F2 + c F3 + d F1 Ex: F1 – F2 + F3 F2 F3 (c) Alexander Ihler

Multi-layer perceptron model • Step functions are just perceptrons! – “ Features ” are outputs of a perceptron – Combination of features output of another F1 w 11  Linear function of features: w 1 w 10 a F1 + b F2 + c F3 + d Out x 1 F2 w 21 w 2   Ex: F1 – F2 + F3 w 20 w 3 1 F3 w 31  “ Output layer ” w 30 “ Hidden layer ” w 10 w 11 W 1 = w 20 w 21 W 2 = w 1 w 2 w 3 w 30 w 31 (c) Alexander Ihler

Multi-layer perceptron model • Step functions are just perceptrons! – “ Features ” are outputs of a perceptron Regression version: – Combination of features output of another Remove activation function from output F1 w 11  Linear function of features: w 1 w 10 a F1 + b F2 + c F3 + d Out x 1 F2 w 21 w 2   Ex: F1 – F2 + F3 w 20 w 3 1 F3 w 31  “ Output layer ” w 30 “ Hidden layer ” w 10 w 11 W 1 = w 20 w 21 W 2 = w 1 w 2 w 3 w 30 w 31 (c) Alexander Ihler

Features of MLPs • Simple building blocks – Each element is just a perceptron f ’ n • Can build upwards Perceptron: Step function / Linear partition Input Features (c) Alexander Ihler

Features of MLPs • Simple building blocks – Each element is just a perceptron f ’ n • Can build upwards 2-layer: “Features” are now partitions All linear combinations of those partitions Input Layer 1 Features (c) Alexander Ihler

Features of MLPs • Simple building blocks – Each element is just a perceptron f ’ n • Can build upwards 3-layer: “Features” are now complex functions Output any linear combination of those Input Layer 2 Layer 1 Features (c) Alexander Ihler

Features of MLPs • Simple building blocks – Each element is just a perceptron f ’ n • Can build upwards Current research: “Deep” architectures … (many layers) … Input Layer 2 Layer 3 Layer 1 Features (c) Alexander Ihler

Features of MLPs • Simple building blocks – Each element is just a perceptron function • Can build upwards • Flexible function approximation – Approximate arbitrary functions with enough hidden nodes y Output … v 1 v 0 h 1 h 2 h 3 Layer 1 h 1 … Input h 2 x 0 x 1 Features … (c) Alexander Ihler

Neural networks • Another term for MLPs w 1 • Biological motivation w 2  w 3 • Neurons – “ Simple ” cells – Dendrites sense charge – Cell weighs inputs – “ Fires ” axon “ How stuff works: the brain ” (c) Alexander Ihler

Activation functions Logistic Hyperbolic Tangent Gaussian ReLU (rectified linear) Linear and many others… (c) Alexander Ihler

Feed-forward networks • Information flows left-to-right – Input observed features H1 – Compute hidden nodes (parallel) W[0] – Compute next layer… X W[1] H2 R = X.dot(W[0])+B[0]; # linear response H1= Sig( R ); # activation f’n S = H1.dot(W[1])+B[1]; # linear response H2 = Sig( S ); # activation f’n % ... • Alternative: recurrent NNs… Information (c) Alexander Ihler

Feed-forward networks A note on multiple outputs: • Regression: – Predict multi-dimensional y – “Shared” representation = fewer parameters • Classification – Predict binary vector – Multi-class classification y = 2 = [0 0 1 0 … ] – Multiple, joint binary predictions (image tagging, etc.) – Often trained as regression (MSE), with saturating activation Information (c) Alexander Ihler

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Backpropagation Kalev Kask

Training MLPs • Observe features “ x ” with target “ y ” • Push “ x ” through NN = output is “ ŷ ” • Error: (y- ŷ) 2 (Can use different loss functions if desired…) • How should we update the weights to improve? • Single layer Hidden Layer – Logistic sigmoid function Inputs Outputs – Smooth, differentiable • Optimize using: – Batch gradient descent – Stochastic gradient descent (c) Alexander Ihler

Gradient calculations • Think of NNs as “schematics” made of smaller functions – Building blocks: summations & nonlinearities – For derivatives, just apply the chain rule, etc! … Hidden Layer Inputs Outputs Ex: f(g,h) = g 2 h save & reuse info (g,h) from forward computation! (c) Alexander Ihler

Backpropagation Forward pass Loss function • Just gradient descent… • Output layer Apply the chain rule to the MLP Hidden layer (Identical to logistic mse regression with inputs “ h j ” ) ŷ k h j (c) Alexander Ihler

Backpropagation Forward pass Loss function • Just gradient descent… • Output layer Apply the chain rule to the MLP Hidden layer (Identical to logistic mse regression with inputs “ h j ” ) ŷ k h j x i (c) Alexander Ihler

Backpropagation Forward pass Loss function • Just gradient descent… • Output layer Apply the chain rule to the MLP Hidden layer % X : (1xN1) H = Sig(X1.dot(W[0])) % W1 : (N2 x N1+1) % H : (1xN2) Yh = Sig(H1.dot(W[1])) % W2 : (N3 x N2+1) % Yh : (1xN3) B2 = (Y-Yhat) * dSig(S) #(1xN3) G2 = B2.T.dot( H ) #(N3x1)*(1xN2)=(N3xN2) B1 = B2.dot(W[1])*dSig(T)#(1xN3).(N3*N2)*(1xN2) G1 = B1.T.dot( X ) #(N2 x N1+1) (c) Alexander Ihler

Example: Regression, MCycle data • Train NN model, 2 layer – 1 input features => 1 input units – 10 hidden units – 1 target => 1 output units – Logistic sigmoid activation for hidden layer, linear for output layer Data: + learned prediction f’n: Responses of hidden nodes (= features of linear regression): select out useful regions of “x” (c) Alexander Ihler

Example: Classification, Iris data • Train NN model, 2 layer – 2 input features => 2 input units – 10 hidden units – 3 classes => 3 output units (y = [0 0 1], etc.) – Logistic sigmoid activation functions – Optimize MSE of predictions using stochastic gradient (c) Alexander Ihler

Dropout [Srivastava et al 2014] • Another recent technique – Randomly “block” some neurons at each step – Trains model to have redundancy (predictions must be robust to blocking) Hidden Layers Hidden Layers Inputs Inputs Output Output Each training prediction: sample neurons to remove % ... during training ... R = X.dot(W[0])+B[0]; # linear response H1= Sig( R ); # activation f’n H1 *= np.random.rand(*H1.shape)<p; #drop out! % ... (c) Alexander Ihler

+ Machine Learning and Data Mining Neural Networks in Practice Kalev Kask

CNNs vs RNNs • CNN – Fixed length input/output – Feed forward – E.g. image recognition • RNN – Variable length input – Feed back – Dynamic temporal behavior – E.g. speech/text processing • http://playground.tensorflow.org (c) Alexander Ihler

MLPs in practice [Hinton et al. 2007] • Example: Deep belief nets – Handwriting recognition – Online demo – 784 pixels  500 mid  500 high  2000 top  10 labels ŷ h 1 h 2 h 3 x h 3 h 2 ŷ h 1 x (c) Alexander Ihler

Machine Learning and Data Mining Multi-layer Perceptrons & - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Kalev Kask Linear classifiers (perceptrons) Linear Classifiers a linear classifier is a mapping which partitions feature space using a linear

Perceptrons Introduction: Neural Networks 1 The Perceptron 2 Using Perceptrons Perceptrons

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Overview Multi-layer networks: Cognitive Modeling limits of single layer networks; Lecture

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

A multi- -layer layer A multi A multi-layer research and training platform research and

Machine Learning & Neural Networks CS16: Introduction to Data Structures & Algorithms

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Perceptrons Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction: Neural Networks The

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Multi-layer Perceptrons & the Back-propagation Algorithm Instructor: Sham Kakade Please email

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

PoET-BiN: Power Efficient Tiny Binary Neurons Sivakumar Chidambaram 1 , J.M. Pierre Langlois 2 ,

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

CMU-IBM-NUS@TRECVID 2012: Surveillance Event Detection(SED) Yang Cai , Qiang Chen , Lisa

Chapter 8 Evaluation Statistical Machine Translation Evaluation How good is a given machine

EC487 Advanced Microeconomics, Part I: Lecture 2 Leonardo Felli 32L.LG.04 6 October, 2017

How to Develop a Program Logic Model Learning objectives By the end of this presentation, you

Darrell Bethea May 19, 2011 1 Assignments: Homework 0 grades up Program 1 due

Machine Learning and Data Mining Multi-layer Perceptrons & - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Kalev Kask Linear classifiers (perceptrons) Linear Classifiers a linear classifier is a mapping which partitions feature space using a linear

Perceptrons Introduction: Neural Networks 1 The Perceptron 2 Using Perceptrons Perceptrons

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Overview Multi-layer networks: Cognitive Modeling limits of single layer networks; Lecture

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

A multi- -layer layer A multi A multi-layer research and training platform research and

Machine Learning &amp; Neural Networks CS16: Introduction to Data Structures &amp; Algorithms

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Perceptrons Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction: Neural Networks The

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Multi-layer Perceptrons &amp; the Back-propagation Algorithm Instructor: Sham Kakade Please email

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

PoET-BiN: Power Efficient Tiny Binary Neurons Sivakumar Chidambaram 1 , J.M. Pierre Langlois 2 ,

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

History &amp; Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

CMU-IBM-NUS@TRECVID 2012: Surveillance Event Detection(SED) Yang Cai *, Qiang Chen *, Lisa

Chapter 8 Evaluation Statistical Machine Translation Evaluation How good is a given machine

EC487 Advanced Microeconomics, Part I: Lecture 2 Leonardo Felli 32L.LG.04 6 October, 2017

How to Develop a Program Logic Model Learning objectives By the end of this presentation, you

Darrell Bethea May 19, 2011 1 Assignments: Homework 0 grades up Program 1 due

Machine Learning & Neural Networks CS16: Introduction to Data Structures & Algorithms

Multi-layer Perceptrons & the Back-propagation Algorithm Instructor: Sham Kakade Please email

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

CMU-IBM-NUS@TRECVID 2012: Surveillance Event Detection(SED) Yang Cai , Qiang Chen , Lisa