deep learning
play

Deep learning Introduction to neural networks Hamid Beigy Sharif - PowerPoint PPT Presentation

Deep learning Deep learning Introduction to neural networks Hamid Beigy Sharif university of technology September 30, 2019 Hamid Beigy | Sharif university of technology | September 30, 2019 1 / 1 Deep learning Table of contents Hamid Beigy |


  1. Deep learning Deep learning Introduction to neural networks Hamid Beigy Sharif university of technology September 30, 2019 Hamid Beigy | Sharif university of technology | September 30, 2019 1 / 1

  2. Deep learning Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 2 / 1

  3. Deep learning | Brain Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 2 / 1

  4. Deep learning | Brain Brain Hamid Beigy | Sharif university of technology | September 30, 2019 3 / 1

  5. Deep learning | Brain Functions of different parts of Brain 2 1 3 4 5 12 6 11 7 8 10 9 Hamid Beigy | Sharif university of technology | September 30, 2019 4 / 1

  6. Deep learning | Brain Brain network Hamid Beigy | Sharif university of technology | September 30, 2019 5 / 1

  7. Deep learning | Brain Neuron Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Hamid Beigy | Sharif university of technology | September 30, 2019 6 / 1

  8. Deep learning | History of neural networks Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 6 / 1

  9. Deep learning | History of neural networks McCulloch and Pitts network (1943) 1 The first model of a neuron was invented by McCulloch (physiologists) and Pitts (logician). 2 Inputs are binary. 3 This neuron has two types of inputs: Excitatory inputs (shown by a ) and Inhibitory inputs(shown by b ). 4 The output is binary: fires (1) and not fires (0). 5 Until the inputs summed up to a certain threshold level, the output would remain zero. Hamid Beigy | Sharif university of technology | September 30, 2019 7 / 1

  10. Deep learning | History of neural networks McCulloch and Pitts network (logic functions) a 1 2 AND a 1 . . . a 2 a n a 1 c t+1 θ 1 OR b 1 . a 2 . . b m b 0 NOT 1 Hamid Beigy | Sharif university of technology | September 30, 2019 8 / 1

  11. Deep learning | History of neural networks Perceptron (Frank Rosenblat (1958)) 1 Problems with McCulloch and Pitts -neurons Weights and thresholds are analytically determined (cannot learn them). Very difficult to minimize size of a network. What about non-discrete and/or non-binary tasks? 2 Perceptron solution. Weights and thresholds can be determined analytically or by a learning algorithm. Continuous, bipolar and multiple-valued versions. Rosenblatt randomly connected the perceptrons and changed the weights in order to achieve learning. Efficient minimization heuristics exist. Hamid Beigy | Sharif university of technology | September 30, 2019 9 / 1

  12. Deep learning | History of neural networks Perceptron (Frank Rosenblat (1958)) Simplified mathematical model • Number of inputs combine linearly 1 Let y be the correct output, and f ( x ) the output function of the – Threshold logic: Fire if combined input exceeds network. Perceptron updates weights (Rosenblatt 1960) threshold w ( t ) ← w ( t ) + α x j ( y − f ( x )) j j 2 McCulloch and Pitts neuron is a better model for the electrochemical process inside the neuron than the Perceptron. 3 But Perceptron is the basis and building block for the modern neural networks. 70 Hamid Beigy | Sharif university of technology | September 30, 2019 10 / 1

  13. Deep learning | History of neural networks Adaline (Bernard Widrow and Ted Hoff (1960) ) 1 The model is same as perceptron, but uses different learning algorithm 2 A multilayer network of Adaline units is known as a MAdaline. Hamid Beigy | Sharif university of technology | September 30, 2019 11 / 1

  14. Deep learning | History of neural networks Adaline learning (Bernard Widrow and Ted Hoff (1960)) 1 Let y be the correct output, and f ( x ) = ∑ n j =0 w j x j . Adaline updates weights w ( t +1) ← w ( t ) + α x j ( y − f ( x )) j j 2 The Adaline converges to the least squares error which is ( y − f ( x )) 2 . This update rule is in fact the stochastic gradient descent update for linear regression. 3 In the 1960’s, there were many articles promising robots that could think. 4 It seems there was a general belief that perceptron could solve any problem. Hamid Beigy | Sharif university of technology | September 30, 2019 12 / 1

  15. Deep learning | History of neural networks Minsky and Papert (1968) Perceptron 1 Minsky and Papert published their book Perceptrons. The book No solution for XOR! shows that perceptrons could only solve linearly separable problems. Not universal! 2 They showed that it is not possible for perceptron to learn an XOR function. X ? ? ? Y • Minsky and Papert, 1968 3 After Perceptrons was published, researchers lost interest in perceptron and neural networks. 74 Hamid Beigy | Sharif university of technology | September 30, 2019 13 / 1

  16. Deep learning | History of neural networks Multi-layer Perceptron (Minsky and Papert (1968)) Multi-layer Perceptron! X 1 1 1 -1 2 1 1 -1 -1 Y Hidden Layer • XOR – The first layer is a “hidden” layer The first layer is a hidden layer. – Also originally suggested by Minsky and Paper 1968 Hamid Beigy | Sharif university of technology | September 30, 2019 76 14 / 1

  17. Deep learning | History of neural networks History 1 Optimization 1 In 1969, Bryson and Ho described proposed Backpropagation as a multi-stage dynamic system optimization method. 2 In 1972, Stephen Grossberg proposed networks capable of learning XOR function. 3 In 1974, Paul Werbos, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams reinvented Backpropagation and applied in the context of neural networks. Back propagation allowed perceptrons to be trained in a multilayer configuration. 2 In 1980s, the filed of artificial neural network research experienced a resurgence. 3 In 2000s, neural networks fell out of favor partly due to BP limitations. 4 In 2010, we are now able to train much larger networks using huge modern computing power such as GPUs. Hamid Beigy | Sharif university of technology | September 30, 2019 15 / 1

  18. Deep learning | History of neural networks History Hamid Beigy | Sharif university of technology | September 30, 2019 16 / 1

  19. Deep learning | Gradient based learning Table of contents Hamid Beigy | Sharif university of technology | September 30, 2019 16 / 1

  20. Deep learning | Gradient based learning Cost function 1 The goal of machine learning algorithms is to construct a model (hypothesis) that can be used to estimate y based on x . 2 Let the model be in form of h ( x ) = w 0 + w 1 x 3 The goal of creating a model is to choose parameters so that h ( x ) is close to y for the training data, x and y . 4 We need a function that will minimize the parameters over our dataset. A function that is often used is mean squared error, m J ( w ) = 1 ∑ ( h ( x i ) − y i ) 2 2 m i =1 5 How do we find the minimum value of cost function? Hamid Beigy | Sharif university of technology | September 30, 2019 17 / 1

  21. Deep learning | Gradient based learning Gradient descent 1 Gradient descent is by far the most popular optimization strategy, used in machine learning and deep learning at the moment. 2 Cost (error) is a function of the weights (parameters). 3 We want to reduce/minimize the error. 4 Gradient descent: move towards the error minimum. 5 Compute gradient, which implies get direction to the error minimum. 6 Adjust weights towards direction of lower error. Hamid Beigy | Sharif university of technology | September 30, 2019 18 / 1

  22. Deep learning | Gradient based learning Gradient descent Hamid Beigy | Sharif university of technology | September 30, 2019 19 / 1

  23. Deep learning | Gradient based learning Gradient descent (Linear Regression) 1 We have the following hypothesis and we need fit to the training data h ( x ) = w 0 + w 1 x 2 We use a cost function such Mean Squared Error m J ( w ) = 1 ∑ ( h ( x i ) − y i ) 2 2 m i =1 3 This cost function can be minimized using gradient descent. − α∂ J ( w ( t ) ) w ( t +1) = w ( t ) 0 0 ∂ w 0 − α∂ J ( w ( t ) ) w ( t +1) = w ( t ) 1 1 ∂ w 1 α is step (learning) rate. Hamid Beigy | Sharif university of technology | September 30, 2019 20 / 1

  24. Deep learning | Gradient based learning Gradient descent (effect of learning rate) Hamid Beigy | Sharif university of technology | September 30, 2019 21 / 1

  25. Deep learning | Gradient based learning Gradient descent (landscape of cost function) 40 1 . 7 z z 1 . 6 20 − 5 0 5 0 0 2 4 0 2 x 0 − 4 − 2 − 2 − 4 y − 5 5 y x Hamid Beigy | Sharif university of technology | September 30, 2019 22 / 1

  26. Deep learning | Gradient based learning Challenges with gradient descent 1 Local minimim: A local minimum is a minimum within some neighborhood that need not be (but may be) a global minimum. 2 Saddle points: For non-convex functions, having the gradient to be 0 is not good enough. Example: f ( x ) = x 2 1 − x 2 2 at x = (0 , 0) has zero gradient but it is clearly not a local minimum as x = (0 , ϵ ) has smaller function value. The point (0 , 0) is called a saddle point of this function. Hamid Beigy | Sharif university of technology | September 30, 2019 23 / 1

  27. Deep learning | Gradient based learning Challenges with gradient descent Hamid Beigy | Sharif university of technology | September 30, 2019 24 / 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend