presentation about deep learning
play

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief - PowerPoint PPT Presentation

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks. Deep learning I . Introduction to Deep Learning


  1. Presentation about Deep Learning --- Zhongwu xie

  2. Contents 1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks.

  3. Deep learning

  4. I . Introduction to Deep Learning Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts , with each concept defined in relation to simpler concepts , and more abstract representations computed in terms of less abstract ones.---Ian Goodfellow

  5. I . Introduction to Deep Learning In the plot on the left , A Venn diagram showing how deep learning is a kind of representation learning , which is in turn of machine learning. In the plot on the left ,the graph shows that deep learning has Multilayer.

  6. I . What is Deep Learning β€’ Data: 𝑦 𝑗 , 𝑧 𝑗 1 ≀ 𝑗 ≀ 𝑛 β€’ Model: ANN β€’ Criterion: -Cost function: 𝑀(𝑧, 𝑔(𝑦)) -Empirical risk minimization: 𝑆 πœ„ = 1 𝑛 𝑀(𝑧 𝑗 , 𝑔(𝑦 𝑗 , πœ„)) 𝑛 Οƒ 𝑗=1 -Regularization: || π‘₯ ||, | π‘₯ | 2 , Early Stopping , Dropout -objective function : π‘›π‘—π‘œπ‘— 𝑆 πœ„ + Ξ» βˆ— (Regularization Function) β€’ Algorithm : BP Gradient descent Learning is cast as optimization.

  7. II . Why should we need to learn Deep Learning? --- Efficiency famous Instances : self-driven β€’ Speech Recognition AlphaGo ---The phoneme error rate on TIMIT: Basing on HMM-GMM in 1990s : about 26% Restricted Boltzmann machines(RBMs) in 2009: 20.7%; LSTM-RNN in 2013:17.7% β€’ Computer Vision ---The Top- 5 error of ILSVRC 2017 Classification Task is 2.251%, while human being’s is 5.1%. β€’ Natural Language Processing ---language model (n-gram) Machine translation β€’ Recommender Systems ---Recommend ads , social network news feeds , movies , jokes , or advice from experts etc.

  8. Backward propagation

  9. I . Introduction to Notation 𝑦 1 𝑨 = π‘₯ π‘ˆ 𝑦 + 𝑐 𝑦 2 𝑏 = ො 𝑧 π‘₯ π‘ˆ 𝑦 + 𝑐 𝑕(𝑦) 𝑨 𝑏 𝑏 = 𝑕(𝑨) 𝑦 3 layer1 [2] π‘₯ 43 layer0 layer2 𝑦 1 π‘š is the weight from the π‘˜ π‘’β„Ž π‘₯ π‘˜π‘™ neuron in the (π‘š βˆ’ 1) π‘’β„Ž layer to the 𝑦 2 𝑙 π‘’β„Ž neuron in the π‘š π‘’β„Ž layer. 𝑦 3

  10. I . Introduction to Forward propagation and Notation [1] = π‘₯ 1 [1] = 𝜏(𝑨 1 [1] , [1] ) 1 π‘ˆ 𝑦 + 𝑐 2 𝑦 1 𝑨 1 𝑏 1 [1] = π‘₯ 2 [1] = 𝜏(𝑨 2 [1] , [1] ) 1 π‘ˆ 𝑦 + 𝑐 2 𝑨 2 𝑏 2 𝑦 2 [1] = π‘₯ 3 [1] = 𝜏(𝑨 3 𝑧 = 𝑏 ො 1 π‘ˆ 𝑦 + 𝑐 3 [1] , [1] ) 𝑨 3 𝑏 3 [1] = π‘₯ 4 [1] = 𝜏(𝑨 4 [1] , [1] ) 1 π‘ˆ 𝑦 + 𝑐 4 𝑨 4 𝑏 4 𝑦 3 π‘₯ [1] 1 𝑦 𝑙 + 𝑐 1 [1] [1] 3 T [1] 𝑨 1 [1] Οƒ 𝑙=1 π‘₯ 𝑙1 𝑏 1 𝑐 1 [1] [1] [1] [1] π‘₯ 11 π‘₯ 12 π‘₯ 13 π‘₯ 14 1 𝑦 𝑙 + 𝑐 2 𝑦 1 [1] [1] [1] [1] 3 = Οƒ 𝑙=1 π‘₯ 𝑙2 𝑨 2 𝑐 2 𝑏 2 𝑨 [1] = 𝑏 [1] = = 𝜏 𝑨 1 , where 𝜏 𝑦 is π‘’β„Že sigmoid function [1] [1] [1] [1] 𝑦 2 + = π‘₯ 21 π‘₯ 22 π‘₯ 23 π‘₯ 24 1 𝑦 𝑙 + 𝑐 3 [1] [1] [1] [1] 3 Οƒ 𝑙=1 𝑏 3 𝑐 3 π‘₯ 𝑙3 𝑨 3 𝑦 3 [1] [1] [1] [1] π‘₯ 31 π‘₯ 32 π‘₯ 33 π‘₯ 34 1 𝑦 𝑙 + 𝑐 4 [1] [1] [1] [1] 3 𝑐 4 𝑏 4 Οƒ 𝑙=1 π‘₯ 𝑙4 𝑨 4 𝑑𝑝𝑑𝑒 π‘”π‘£π‘œπ‘‘π‘’π‘—π‘π‘œ: 𝑀 𝑏, 𝑧 𝑒π‘₯ [1] = πœ–π‘€(𝑏,𝑧) πœ–π‘₯ [1] , 𝑒𝑐 [1] = πœ–π‘€(𝑏,𝑧) πœ–π‘ [1]

  11. II . Backward propagation. --- the chain rule If 𝑦 = 𝑔 π‘₯ , 𝑧 = 𝑔 𝑦 , 𝑨 = 𝑔(𝑧) πœ–π‘¨ πœ–π‘¨ πœ–π‘§ πœ–π‘¦ So, πœ–π‘₯ = πœ–π‘§ πœ–π‘¦ πœ–π‘₯ ---the functions of neural network are same as the above function , so we can use the chain rule to the gradient of the neural network. 𝑦 𝑏 = 𝜏 (z) 𝑨 = π‘₯ π‘ˆ 𝑦 + 𝑐 π‘₯ 𝑀 𝑏, 𝑧 𝑐

  12. II . Backward propagation. --- the chain rule π‘₯ [2] 𝑀 𝑏, 𝑧 = βˆ’[π‘§π‘šπ‘π‘•π‘ + 1 βˆ’ 𝑧 log 1 βˆ’ 𝑏 ] 𝑐 [2] 𝑦 π‘₯ [1] 𝑨 [1] = π‘₯ [1] 𝑦 + 𝑐 [1] 𝑏 [1] = 𝜏 ( 𝑨 [1] ) 𝑨 [2] = π‘₯ [2] 𝑏 [1] + 𝑐 [2] 𝑏 [2] = 𝜏 ( 𝑨 [2] ) 𝑀 𝑏 [2] , 𝑧 𝑐 [1] πœ–π‘€(𝑏,𝑧) 𝑧 1βˆ’π‘§ 𝑒𝑏 [2] = πœ–π‘ [2] πœ–π‘¨ [2] πœ–π‘ [1] πœ–π‘¨ [1] 𝑒π‘₯ [1] = πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) πœ–π‘ [2] = βˆ’ 𝑏 + πœ–π‘₯ [1] = 𝑒𝑨 [1] 𝑦 π‘ˆ πœ–π‘₯ [1] = πœ–π‘ [2] Γ— πœ–π‘¨ [2] Γ— πœ–π‘ [1] Γ— πœ–π‘¨ [1] Γ— 1βˆ’π‘ πœ–π‘ [2] 𝑒𝑨 [2] = πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) πœ–π‘¨ [2] = 𝑏 [2] βˆ’ 𝑧 Γ— πœ–π‘ [2] πœ–π‘¨ [2] Γ— πœ–π‘¨ [2] πœ–π‘ [1] Γ— πœ–π‘ [1] πœ–π‘¨ [1] Γ— πœ–π‘¨ [1] 𝑒𝑐 [1] = πœ–π‘€(𝑏, 𝑧) = πœ–π‘€(𝑏, 𝑧) πœ–π‘ [2] Γ— πœ–π‘¨ [2] = πœ–π‘ [1] = 𝑒𝑨 [1] πœ–π‘ [1] 𝑏 [2] πœ–π‘ [2] πœ–π‘¨ [2] 𝑒π‘₯ [2] = πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) πœ–π‘₯ [2] = 𝑒𝑨 [2] 𝑏 1 π‘ˆ Γ— πœ–π‘¨ [2] Γ— πœ–π‘₯ [2] = 𝑏 [2] Γ— πœ–π‘ [2] πœ–π‘¨ [2] Γ— πœ–π‘¨ [2] 𝑒𝑐 [2] = πœ–π‘€(𝑏, 𝑧) = πœ–π‘€(𝑏, 𝑧) πœ–π‘ [2] = 𝑒𝑨 [2] πœ–π‘ [2] 𝑏 [2] Γ— πœ–π‘ [2] πœ–π‘¨ [2] Γ— πœ–π‘¨ [2] πœ–π‘ [1] Γ— πœ–π‘ [1] 𝑒𝑨 [1] = πœ–π‘€(𝑏, 𝑧) = πœ–π‘€(𝑏, 𝑧) πœ–π‘¨ [1] 𝑏 [2] πœ–π‘¨ [1] = π‘₯ 2 π‘ˆ 𝑒𝑨 [2] * 𝜏 β€² (𝑨 [1] )

  13. II . Summary : The Backpropagation [π‘š] [π‘š+1] πœ–π‘ π‘˜ πœ–π‘ π‘Ÿ πœ–π· π‘š πœ–π‘₯ π‘š πœ–π‘ π‘˜ π‘˜π‘™ 𝑀 πœ–π‘ 𝑛 … … [π‘š] π‘š+1 𝑀 π‘€βˆ’1 πœ–π‘ π‘˜ πœ–π· πœ–π‘ 𝑛 πœ–π‘ π‘œ π‘€βˆ’2 βˆ™βˆ™βˆ™ πœ–π‘ π‘Ÿ … π‘š βˆ†π· β‰ˆ ෍ π‘š βˆ†π‘₯ π‘˜π‘™ 𝑀 π‘€βˆ’1 π‘š πœ–π‘ 𝑛 πœ–π‘ π‘œ πœ–π‘ π‘ž πœ–π‘ π‘˜ πœ–π‘₯ π‘˜π‘™ mnπ‘ž..π‘Ÿ … … βˆ†π· [π‘š] … 𝑀 π‘€βˆ’1 π‘š+1 πœ–π‘ π‘˜ π‘€βˆ’2 βˆ™βˆ™βˆ™ πœ–π‘ π‘Ÿ πœ–π· πœ–π· πœ–π‘ 𝑛 πœ–π‘ π‘œ π‘š = ෍ 𝑀 π‘€βˆ’1 π‘š π‘š πœ–π‘ 𝑛 πœ–π‘ π‘œ πœ–π‘ π‘ž πœ–π‘₯ πœ–π‘ π‘˜ πœ–π‘₯ … … π‘˜π‘™ π‘˜π‘™ mnπ‘ž..π‘Ÿ The backpropagation algorithm is a clever way of keeping track of small perturbations the weights (and biases) as they propagate through the network , reach the output , and then affect the cost. ---Michael Nielsen

  14. II . Summary : The Backpropagation algorithm 1.Input 𝑦 :Set the corresponding activation for the input layer. 2.Feedforward : For each π‘š = πŸ‘, πŸ’, … , 𝐌 compute 𝑨 [π‘š] = π‘₯ [π‘š] 𝑏 [π‘šβˆ’1] + 𝑐 [π‘š] and 𝑏 [π‘š] = 𝜏 𝑨 π‘š . 3.Output error 𝑒𝑨 [𝑀] : 𝑒𝑨 [𝑀] = 𝑏 [𝑀] - 𝑧. T 4.Back propagate the cost error:For each l =L-1,L- 2,…2 compute : dz [π‘š] = (w π‘š+1 ) dz [π‘š+1] βˆ— 𝜏 β€² (z [π‘š] ) 5.Output : The gradient of the cost function is given by : πœ–π‘€(𝑏,𝑧) πœ–π‘€(𝑏,𝑧) 𝑒π‘₯ [π‘š] = πœ–π‘₯ [π‘š] = 𝑒𝑨 [π‘š] 𝑏 π‘šβˆ’1 π‘ˆ and 𝑒𝑐 [π‘š] = πœ–π‘ [π‘š] = 𝑒𝑨 [π‘š] [π‘š] and 𝑐 [π‘š] : Update the π‘₯ π‘˜π‘™ π‘˜ [π‘š] βˆ’ 𝛽 πœ–π‘€(𝑏,𝑧) [π‘š] = π‘₯ π‘₯ π‘˜π‘™ π‘˜π‘™ [π‘š] πœ–π‘₯ π‘˜π‘™ [π‘š] = 𝑐 [π‘š] βˆ’π›½ πœ–π‘€(𝑏,𝑧) 𝑐 π‘˜ π‘˜ [π‘š] πœ– 𝑐 π‘˜

  15. Convolutional Neural Networks

  16. 1 . Types of layers in a convolutional network. β€’ -Convolution β€’ -Pooling β€’ -Fully connected

  17. 2.1 Convolution in Neural Network 10 10 10 0 0 0 0 30 30 0 10 10 10 0 0 0 1 0 -1 0 30 30 0 10 10 10 0 0 0 1 0 -1 * = 0 30 30 0 10 10 10 0 0 0 1 0 -1 10 10 10 0 0 0 0 30 30 0 10 10 10 0 0 0 10 10 10 1 0 -1 * = 0 1 0 -1 10 10 10 1 0 -1 10 10 10

  18. 2.2 Multiple filters = * 3 Γ— 3 Γ— 3 4 Γ— 4 4 Γ— 4 Γ— 2 6 Γ— 6 Γ— 3 = * 4 Γ— 4 Why convolutions ? 3 Γ— 3 Γ— 3 ---Parameter sharing ---Sparsity of connections

  19. 3 . Pooling layers β€’ Max pooling 1 3 2 1 Hyperparameters: 9 2 2 9 1 1 Max pool with 2 Γ— 2 f:filter size filters and stride 2 s:stride 1 3 2 3 6 3 Max or average pooling 5 6 1 2 β€’ Remove the redundancy information of convolutional layer . ---By having less spatial information you gain computation performance ---Less spatial information also means less parameters, so less chance to over- fit ---You get some translation invariance

  20. 3 . Full connection layer The CNNs help extract certain features from the image , then fully connected layer is able to generalize from these features into the output-space. [LeCun et al.,1998.Gradient-based learning applied to document recognition.]

  21. 4 . Classic networks---AlexNet π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š 3 Γ— 3 3 Γ— 3 11 Γ— 11 5 Γ— 5 S=2 S=4 S=4 Same 27 Γ— 27 Γ— 96 27 Γ— 27 Γ— 256 13 Γ— 13 Γ— 256 55 Γ— 55 Γ— 96 Parameters:9216 Γ— 4096 Γ— 4096=154,618,822,656 227 Γ— 227 Γ— 3 π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š 𝑇𝑝𝑔𝑒𝑛𝑏𝑦 = = 3 Γ— 3 1000 3 Γ— 3 3 Γ— 3 3 Γ— 3 Same S=2 13 Γ— 13 Γ— 256 9216 4096 4096 6 Γ— 6 Γ— 256 13 Γ— 13 Γ— 384 13 Γ— 13 Γ— 384

  22. Thank you

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend