Advanced Machine Learning Dense Neural Networks Amit Sethi - PowerPoint PPT Presentation

Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay

Learning objectives • Learn the motivations behind neural networks • Become familiar with neural network terms • Understand the working of neural networks • Understand behind-the-scenes training of neural networks

Neural networks are inspired from mammalian brain • Each unit (neuron) is simple • But, human brain has 100 billion neurons with 100 trillion connections • The strength and nature of the connections stores memories and the “program” that makes us human • A neural network is a web of artificial neurons

Artificial neurons is inspired by biological neurons • Neural networks are made up of artificial neurons • Artificial neurons are only loosely based on real neurons, just like neural networks are only loosely based on the human brain x 1 w 1 w 2 Σ g x 2 w 3 x 3 b 1

Activation function is the secret sauce of neural networks • Neural network training is all about tuning weights x 1 and biases w 1 • If there was no activation w 2 x 2 Σ g function f , the output of w 3 the entire neural network x 3 would be a linear function b of the inputs 1 • The earliest models used a step function

Types of activation functions • Step : original concept behind classification and region bifurcation. Not used anymore • Sigmoid and tanh: trainable approximations of the step-function • ReLU : currently preferred due to fast convergence • Softmax : currently preferred for output of a classification net. Generalized sigmoid • Linear : good for modeling a range in the output of a regression net

Formulas for activation functions sign 𝑦 +1 • Step : 𝑕 𝑦 = 2 1 • Sigmoid : 𝑕 𝑦 = 1+𝑓 −𝑦 • Tanh : 𝑕 𝑦 = tanh (𝑦) • ReLU : 𝑕 𝑦 = max (0, 𝑦) 𝑓 𝑦𝑗 • Softmax : 𝑕 𝑦 𝑗 = 𝑓 𝑦𝑗 𝑗 • Linear : 𝑕 𝑦 = 𝑦

Step function divides the input space into two halves  0 and 1 • In a single neuron, step function is a linear binary classifier • The weights and biases determine where the step will be in n-dimensions • But, as we shall see later, it gives little information about how to change the weights if we make a mistake • So, we need a smoother version of a step function • Enter: the Sigmoid function

The sigmoid function is a smoother step function • Smoothness ensures that there is more information about the direction in which to change the weights if there are errors • Sigmoid function is also mathematically linked to logistic regression, which is a theoretically well-backed linear classifier

The problem with sigmoid is (near) zero gradient on both extremes • For both large positive and negative input values, sigmoid doesn’t change much with change of input • ReLU has a constant gradient for almost half of the inputs • But, ReLU cannot give a meaningful final output

Output activation functions can only be of the following kinds • Sigmoid gives binary classification output • Tanh can also do that provided the desired output is in {-1, +1} • Softmax generalizes sigmoid to n-ary classification • Linear is used for regression • ReLU is only used in internal nodes (non- output)

Contents • Introduction to neural networks • Feed forward neural networks • Gradient descent and backpropagation • Learning rate setting and tuning

Basic structure of a neural network • It is feed forward y 1 y 2 … y n – Connections from inputs towards outputs – No connection … comes backwards • It consists of layers … … … – Current layer’s h 1n input is previous h 11 h 12 … 1 layer’s output – No lateral (intra- layer) connections … x 1 x 2 x d • That’s it!

Basic structure of a neural network • Output layer – y 1 y 2 … y n Represent the output of the neural network – For a two class problem or regression with a 1-d output, we need only one output node • Hidden layer(s) … – Represent the intermediary nodes that divide the input space into regions with (soft) boundaries – These usually form a hidden layer … … … – Usually, there is only one such layer – Given enough hidden nodes, we can model an arbitrary input- h 1n output relation. h 11 h 12 … • 1 Input layer – Represent dimensions of the input vector (one node for each dimension) – These usually form an input layer , … x 1 x 2 x d and – Usually there is only one such layer

Importance of hidden layers + − + • First hidden + Single + − sigmoid layer extracts − − + + − + features − − + − • Second hidden + + layer extracts features of features • … + − + + Sigmoid • Output layer + − hidden − − gives the layers and + + − + − sigmoid desired output − + − output + +

Overall function of a neural network • 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Weights form a matrix • Output of the previous layer form a vector • The activation (nonlinear) function is applied point-wise to the weight times input • Design questions (hyper parameters): – Number of layers – Number of neurons in each layer (rows of weight matrices)

Training the neural network • Given 𝒚 𝑗 and 𝑧 𝑗 • Think of what hyper-parameters and neural network design might work • Form a neural network: 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Compute 𝑔 𝒙 𝒚 𝑗 as an estimate of 𝑧 𝑗 for all samples • Compute loss: 1 = 1 𝑂 𝑂 𝑂 𝑂 𝑀(𝑔 𝒙 𝒚 𝑗 , 𝑧 𝑗 ) 𝑚 𝑗 (𝒙) 𝑗=1 𝑗=1 • Tweak 𝒙 to reduce loss (optimization algorithm) • Repeat last three steps

Loss function choice • There are positive and negative errors in classification and MSE is the most common loss function • There is error→ probability of correct class in classification, for which cross entropy is the most common loss function error→

Some loss functions and their derivatives • Terminology – 𝑧 is the output – 𝑢 is the target output • Mean square error • Loss: (𝑧 − 𝑢 ) 2 • Derivative of the loss: 2(𝑧 − 𝑢 ) • Cross entropy 𝐷 • Loss: − 𝑢 𝑑 log 𝑧 𝑑 𝑑=1 1 • Derivative of the loss: − 𝑧 𝑑 | 𝑑=𝜕

Computational graph of a single hidden layer NN x ? * ReL + Z1 A1 W1 U b1 ? * SoftM + Z2 A2 W2 ax b2 CE Loss targ et

Advanced Machine Learning Backpropagation Amit Sethi Electrical Engineering, IIT Bombay

Learning objectives • Write derivative of a nested function using chain rule • Articulate how storage of partial derivatives leads to an efficient gradient descent for neural networks • Write gradient descent as matrix operations

Overall function of a neural network • 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Weights form a matrix • Output of the previous layer form a vector • The activation (nonlinear) function is applied point-wise to the weight times input • Design questions (hyper parameters): – Number of layers – Number of neurons in each layer (rows of weight matrices)

Training the neural network • Given 𝒚 𝑗 and 𝑧 𝑗 • Think of what hyper-parameters and neural network design might work • Form a neural network: 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Compute 𝑔 𝒙 𝒚 𝑗 as an estimate of 𝑧 𝑗 for all samples • Compute loss: 1 = 1 𝑂 𝑂 𝑂 𝑂 𝑀(𝑔 𝒙 𝒚 𝑗 , 𝑧 𝑗 ) 𝑚 𝑗 (𝒙) 𝑗=1 𝑗=1 • Tweak 𝒙 to reduce loss (optimization algorithm) • Repeat last three steps

Gradient ascent • If you didn’t know the shape of a mountain • But at every step you knew the slope • Can you reach the top of the mountain?

Gradient descent minimizes the loss function • At every point, compute • Loss (scalar): 𝑚 𝑗 (𝒙) • Gradient of loss with respect to weights (vector): 𝛼 𝒙 𝑚 𝑗 (𝒙) • Take a step towards negative gradient: 𝑂 1 𝒙 ← 𝒙 − 𝜃 𝛼 𝑂 𝑚 𝑗 (𝒙) 𝒙 𝑗=1

Derivative of a function of a scalar E.g. 𝑔 𝑦 = 𝑏𝑦 2 + 𝑐𝑦 + 𝑑, 𝑔 ′ (𝑦) = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏 𝑒 𝑔(𝑦) • Derivative 𝑔’ 𝑦 = 𝑒 𝑦 is the rate of change of 𝑔 𝑦 with 𝑦 • It is zero when then function is flat (horizontal), such as at the minimum or maximum of 𝑔 𝑦 • It is positive when 𝑔 𝑦 is sloping up, and negative when 𝑔 𝑦 is sloping down • To move towards the maxima, taking a small step in a direction of the derivative

Gradient of a function of a vector • Derivative with respect to each dimension, holding other dimensions constant f(x 1 , x 2 ) → 𝜖𝑔 𝜖𝑦 1 • 𝛼𝑔 𝒚 = 𝛼𝑔 𝑦 1 , 𝑦 2 = 𝜖𝑔 𝜖𝑦 2 • At a minima or a maxima the gradient is a zero vector The function is flat in every direction • At a minima or a maxima the gradient is a zero vector

Gradient of a function of a vector • Gradient gives a direction for moving towards the minima • Take a small step towards f(x 1 , x 2 ) → negative of the gradient

Advanced Machine Learning Dense Neural Networks Amit Sethi - PowerPoint PPT Presentation

Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay Learning objectives Learn the motivations behind neural networks Become familiar with neural network terms Understand the working of neural

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

HESSIAN vs OFFSET method PDF4LHC F b PDF4LHC February 2008 2008 A M Cooper-Sarkar Comparisons

Basics of Numerical Optimization: Preliminaries Ju Sun Computer Science & Engineering

Survival Analysis Objective : to establish a connection between a set of features and the time

Computational Optimization Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Preliminary Course on Mathematics winter term 2014/2015 Veronika Penner

Numerical study of complex instantons in the Gross- Witten U(N) matrix model S. Valgushev, P.

Pr Probability obability an and d Ti Time: me: Hidden dden Markov arkov Model odels s (H

Hidden Markov Models. Petr Pok Czech Technical University in Prague Faculty of Electrical

Advanced Machine Learning Dense Neural Networks Amit Sethi - PowerPoint PPT Presentation

Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay Learning objectives Learn the motivations behind neural networks Become familiar with neural network terms Understand the working of neural

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

HESSIAN vs OFFSET method PDF4LHC F b PDF4LHC February 2008 2008 A M Cooper-Sarkar Comparisons

Basics of Numerical Optimization: Preliminaries Ju Sun Computer Science &amp; Engineering

Survival Analysis Objective : to establish a connection between a set of features and the time

Computational Optimization Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Preliminary Course on Mathematics winter term 2014/2015 Veronika Penner

Numerical study of complex instantons in the Gross- Witten U(N) matrix model S. Valgushev, P.

Pr Probability obability an and d Ti Time: me: Hidden dden Markov arkov Model odels s (H

Hidden Markov Models. Petr Pok Czech Technical University in Prague Faculty of Electrical

Basics of Numerical Optimization: Preliminaries Ju Sun Computer Science & Engineering