Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief - - PowerPoint PPT Presentation

β–Ά
presentation about deep learning
SMART_READER_LITE
LIVE PREVIEW

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief - - PowerPoint PPT Presentation

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks. Deep learning I . Introduction to Deep Learning


slide-1
SLIDE 1

Presentation about Deep Learning

  • -- Zhongwu xie
slide-2
SLIDE 2

Contents

1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks.

slide-3
SLIDE 3

Deep learning

slide-4
SLIDE 4

I . Introduction to Deep Learning

Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts , with each concept defined in relation to simpler concepts , and more abstract representations computed in terms of less abstract ones.---Ian Goodfellow

slide-5
SLIDE 5

I . Introduction to Deep Learning

In the plot on the left , A Venn diagram showing how deep learning is a kind of representation learning , which is in turn of machine learning. In the plot on the left ,the graph shows that deep learning has Multilayer.

slide-6
SLIDE 6
  • Data: 𝑦𝑗, 𝑧𝑗

1 ≀ 𝑗 ≀ 𝑛

  • Model: ANN
  • Criterion:
  • Cost function: 𝑀(𝑧, 𝑔(𝑦))
  • Empirical risk minimization: 𝑆 πœ„ = 1

𝑛 σ𝑗=1 𝑛 𝑀(𝑧𝑗, 𝑔(𝑦𝑗, πœ„))

  • Regularization: ||π‘₯||,| π‘₯ |2, Early Stopping , Dropout
  • objective functionοΌšπ‘›π‘—π‘œπ‘— 𝑆 πœ„ + Ξ»βˆ—(Regularization Function)
  • Algorithm : BP Gradient descent

Learning is cast as optimization.

I . What is Deep Learning

slide-7
SLIDE 7

II . Why should we need to learn Deep Learning?

  • -- Efficiency
  • Speech Recognition
  • --The phoneme error rate on TIMIT:

Basing on HMM-GMM in 1990s : about 26% Restricted Boltzmann machines(RBMs) in 2009: 20.7%; LSTM-RNN in 2013:17.7%

  • Computer Vision
  • --The Top-5 error of ILSVRC 2017 Classification Task is 2.251%, while human being’s is 5.1%.
  • Natural Language Processing
  • --language model (n-gram) Machine translation
  • Recommender Systems
  • --Recommend ads , social network news feeds , movies , jokes , or advice from experts etc.

famous Instances : self-driven AlphaGo

slide-8
SLIDE 8

Backward propagation

slide-9
SLIDE 9

I . Introduction to Notation

layer0 layer1 layer2

π‘₯

π‘˜π‘™ π‘š

is the weight from the π‘˜π‘’β„Ž neuron in the (π‘š βˆ’ 1)π‘’β„Ž layer to the π‘™π‘’β„Ž neuron in the π‘šπ‘’β„Ž layer. π‘₯43

[2]

π‘₯π‘ˆπ‘¦ + 𝑐 𝑨 𝑕(𝑦) 𝑏

𝑏 = ො 𝑧 𝑨 = π‘₯π‘ˆπ‘¦ + 𝑐 𝑏 = 𝑕(𝑨) 𝑦1 𝑦3 𝑦2 𝑦1 𝑦2 𝑦3

slide-10
SLIDE 10

I . Introduction to Forward propagation and Notation

𝑦1 𝑦3 𝑦2

ො 𝑧 = 𝑏

𝑨1

[1] = π‘₯1 1 π‘ˆπ‘¦ + 𝑐2 [1],

𝑏1

[1] = 𝜏(𝑨1 [1])

𝑨2

[1] = π‘₯2 1 π‘ˆπ‘¦ + 𝑐2 [1],

𝑏2

[1] = 𝜏(𝑨2 [1])

𝑨3

[1] = π‘₯3 1 π‘ˆπ‘¦ + 𝑐3 [1],

𝑏3

[1] = 𝜏(𝑨3 [1])

𝑨4

[1] = π‘₯4 1 π‘ˆπ‘¦ + 𝑐4 [1],

𝑏4

[1] = 𝜏(𝑨4 [1])

𝑨[1] = π‘₯11

[1]

π‘₯12

[1]

π‘₯13

[1]

π‘₯14

[1]

π‘₯21

[1]

π‘₯22

[1]

π‘₯23

[1]

π‘₯24

[1]

π‘₯31

[1]

π‘₯32

[1]

π‘₯33

[1]

π‘₯34

[1]

𝑦1 𝑦2 𝑦3 + 𝑐1

[1]

𝑐2

[1]

𝑐3

[1]

𝑐4

[1]

=

σ𝑙=1

3

π‘₯𝑙1

1 𝑦𝑙 + 𝑐1 [1]

σ𝑙=1

3

π‘₯𝑙2

1 𝑦𝑙 + 𝑐2 [1]

σ𝑙=1

3

π‘₯𝑙3

1 𝑦𝑙 + 𝑐3 [1]

σ𝑙=1

3

π‘₯𝑙4

1 𝑦𝑙 + 𝑐4 [1]

=

𝑨1

[1]

𝑨2

[1]

𝑨3

[1]

𝑨4

[1]

T

π‘₯[1]

𝑏[1] = 𝑏1

[1]

𝑏2

[1]

𝑏3

[1]

𝑏4

[1]

= 𝜏 𝑨 1 , where 𝜏 𝑦 is π‘’β„Že sigmoid function

𝑑𝑝𝑑𝑒 π‘”π‘£π‘œπ‘‘π‘’π‘—π‘π‘œ: 𝑀 𝑏, 𝑧 𝑒π‘₯[1] =

πœ–π‘€(𝑏,𝑧) πœ–π‘₯[1] , 𝑒𝑐[1] = πœ–π‘€(𝑏,𝑧) πœ–π‘[1]

slide-11
SLIDE 11

II . Backward propagation.

𝑦 π‘₯ 𝑐

  • --the chain rule

𝑨 = π‘₯π‘ˆπ‘¦ + 𝑐 𝑏 = 𝜏 (z) 𝑀 𝑏, 𝑧 If 𝑦 = 𝑔 π‘₯ , 𝑧 = 𝑔 𝑦 , 𝑨 = 𝑔(𝑧) So,

πœ–π‘¨ πœ–π‘₯ = πœ–π‘¨ πœ–π‘§ πœ–π‘§ πœ–π‘¦ πœ–π‘¦ πœ–π‘₯

  • --the functions of neural network are same as the above function , so we

can use the chain rule to the gradient of the neural network.

slide-12
SLIDE 12

II . Backward propagation.

𝑦 π‘₯[1] 𝑐[1]

  • --the chain rule

𝑨[1] = π‘₯[1]𝑦 + 𝑐[1] 𝑏[1] = 𝜏(𝑨[1]) 𝑀 𝑏[2], 𝑧 𝑒𝑏[2] =

πœ–π‘€(𝑏,𝑧) πœ–π‘[2] =βˆ’ 𝑧 𝑏 + 1βˆ’π‘§ 1βˆ’π‘

𝑒𝑨[2] =

πœ–π‘€(𝑏,𝑧) πœ–π‘¨[2] = πœ–π‘€(𝑏,𝑧) πœ–π‘[2] Γ— πœ–π‘[2] πœ–π‘¨[2] = 𝑏[2] βˆ’ 𝑧

𝑒π‘₯[2] =

πœ–π‘€(𝑏,𝑧) πœ–π‘₯[2] = πœ–π‘€(𝑏,𝑧) 𝑏[2]

Γ—

πœ–π‘[2] πœ–π‘¨[2] Γ— πœ–π‘¨[2] πœ–π‘₯[2] = 𝑒𝑨[2]𝑏 1 π‘ˆ

𝑒𝑐[2] = πœ–π‘€(𝑏, 𝑧) πœ–π‘[2] = πœ–π‘€(𝑏, 𝑧) 𝑏[2] Γ— πœ–π‘[2] πœ–π‘¨[2] Γ— πœ–π‘¨[2] πœ–π‘[2] = 𝑒𝑨[2] 𝑒𝑨[1] = πœ–π‘€(𝑏, 𝑧) πœ–π‘¨[1] = πœ–π‘€(𝑏, 𝑧) 𝑏[2] Γ— πœ–π‘[2] πœ–π‘¨[2] Γ— πœ–π‘¨[2] πœ–π‘[1] Γ— πœ–π‘[1] πœ–π‘¨[1] = π‘₯ 2 π‘ˆπ‘’π‘¨[2]* πœβ€²(𝑨[1]) 𝑨[2] = π‘₯[2]𝑏[1] + 𝑐[2] 𝑏[2] = 𝜏(𝑨[2])

π‘₯[2] 𝑐[2]

𝑒π‘₯[1] =

πœ–π‘€(𝑏,𝑧) πœ–π‘₯[1] = πœ–π‘€(𝑏,𝑧) πœ–π‘[2] Γ— πœ–π‘[2] πœ–π‘¨[2] Γ— πœ–π‘¨[2] πœ–π‘[1] Γ— πœ–π‘[1] πœ–π‘¨[1] Γ— πœ–π‘¨[1] πœ–π‘₯[1]=𝑒𝑨[1]π‘¦π‘ˆ

𝑒𝑐[1] = πœ–π‘€(𝑏, 𝑧) πœ–π‘[1] = πœ–π‘€(𝑏, 𝑧) 𝑏[2] Γ— πœ–π‘[2] πœ–π‘¨[2] Γ— πœ–π‘¨[2] πœ–π‘[1] Γ— πœ–π‘[1] πœ–π‘¨[1] Γ— πœ–π‘¨[1] πœ–π‘[1] =𝑒𝑨[1] 𝑀 𝑏, 𝑧 = βˆ’[π‘§π‘šπ‘π‘•π‘ + 1 βˆ’ 𝑧 log 1 βˆ’ 𝑏 ]

slide-13
SLIDE 13

II . Summary : The Backpropagation

… … … … … … … … βˆ†π·

πœ–π‘π‘˜

[π‘š]

πœ–π‘₯

π‘˜π‘™ π‘š

πœ–π· πœ–π‘₯

π‘˜π‘™ π‘š =

෍

mnπ‘ž..π‘Ÿ

πœ–π· πœ–π‘π‘›

𝑀

πœ–π‘π‘›

𝑀

πœ–π‘π‘œ

π‘€βˆ’1

πœ–π‘π‘œ

π‘€βˆ’1

πœ–π‘π‘ž

π‘€βˆ’2 βˆ™βˆ™βˆ™ πœ–π‘π‘Ÿ π‘š+1

πœ–π‘π‘˜

π‘š

πœ–π‘π‘˜

[π‘š]

πœ–π‘₯

π‘˜π‘™ π‘š

πœ–π‘π‘Ÿ

[π‘š+1]

πœ–π‘π‘˜

π‘š

βˆ†π· β‰ˆ ෍

mnπ‘ž..π‘Ÿ

πœ–π· πœ–π‘π‘›

𝑀

πœ–π‘π‘›

𝑀

πœ–π‘π‘œ

π‘€βˆ’1

πœ–π‘π‘œ

π‘€βˆ’1

πœ–π‘π‘ž

π‘€βˆ’2 βˆ™βˆ™βˆ™ πœ–π‘π‘Ÿ π‘š+1

πœ–π‘π‘˜

π‘š

πœ–π‘π‘˜

[π‘š]

πœ–π‘₯

π‘˜π‘™ π‘š βˆ†π‘₯ π‘˜π‘™ π‘š

πœ–π· πœ–π‘π‘›

𝑀

The backpropagation algorithm is a clever way of keeping track of small perturbations the weights (and biases) as they propagate through the network , reach the output , and then affect the cost.

  • --Michael Nielsen
slide-14
SLIDE 14

II . Summary : The Backpropagation algorithm

1.Input 𝑦:Set the corresponding activation for the input layer. 2.Feedforward : For each π‘š = πŸ‘, πŸ’, … , 𝐌 compute 𝑨[π‘š]=π‘₯[π‘š]𝑏[π‘šβˆ’1] + 𝑐[π‘š] and 𝑏[π‘š]= 𝜏 𝑨 π‘š . 3.Output error 𝑒𝑨[𝑀]:𝑒𝑨[𝑀]=𝑏[𝑀]- 𝑧. 4.Back propagate the cost error:For each l=L-1,L-2,…2 compute : dz[π‘š]=(w π‘š+1 ) dz[π‘š+1] βˆ— πœβ€²(z[π‘š]) 5.Output : The gradient of the cost function is given by: 𝑒π‘₯[π‘š] =

πœ–π‘€(𝑏,𝑧) πœ–π‘₯[π‘š] =𝑒𝑨[π‘š]𝑏 π‘šβˆ’1 π‘ˆ and 𝑒𝑐[π‘š] = πœ–π‘€(𝑏,𝑧) πœ–π‘[π‘š] = 𝑒𝑨[π‘š]

T

Update the π‘₯

π‘˜π‘™ [π‘š]and 𝑐 π‘˜ [π‘š]:

π‘₯

π‘˜π‘™ [π‘š]=π‘₯ π‘˜π‘™ [π‘š] βˆ’ 𝛽 πœ–π‘€(𝑏,𝑧) πœ–π‘₯π‘˜π‘™

[π‘š]

𝑐

π‘˜ [π‘š]= 𝑐 π‘˜ [π‘š]βˆ’π›½ πœ–π‘€(𝑏,𝑧) πœ– π‘π‘˜

[π‘š]

slide-15
SLIDE 15

Convolutional Neural Networks

slide-16
SLIDE 16

1 . Types of layers in a convolutional network.

  • -Convolution
  • -Pooling
  • -Fully connected
slide-17
SLIDE 17

2.1 Convolution in Neural Network

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

* =

30 30 30 30 30 30 30 30 10 10 10 10 10 10 10 10 10 1

  • 1

1

  • 1

1

  • 1

*

1

  • 1

1

  • 1

1

  • 1

=

slide-18
SLIDE 18

2.2 Multiple filters

* * = =

6 Γ— 6 Γ— 3 3 Γ— 3 Γ— 3 3 Γ— 3 Γ— 3 4 Γ— 4 4 Γ— 4 4 Γ— 4 Γ—2

Why convolutions?

  • --Parameter sharing
  • --Sparsity of connections
slide-19
SLIDE 19

3 . Pooling layers

  • Max pooling

1 3 2 1 2 9 1 1 1 3 2 3 5 6 1 2 9 2 6 3 Hyperparameters: f:filter size s:stride Max or average pooling

Max pool with 2 Γ—2 filters and stride 2

  • Remove the redundancy information of convolutional layer .
  • --By having less spatial information you gain computation performance
  • --Less spatial information also means less parameters, so less chance to over-

fit

  • --You get some translation invariance
slide-20
SLIDE 20

3 . Full connection layer

The CNNs help extract certain features from the image , then fully connected layer is able to generalize from these features into the output-space.

[LeCun et al.,1998.Gradient-based learning applied to document recognition.]

slide-21
SLIDE 21

4 . Classic networks---AlexNet

11 Γ—11 S=4

55 Γ— 55 Γ— 96 227 Γ— 227 Γ— 3

3 Γ— 3 S=4

π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š

5 Γ— 5 Same

27 Γ— 27 Γ— 96 27 Γ— 27 Γ— 256

π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š

13 Γ— 13 Γ— 256

3 Γ— 3 S=2 3 Γ— 3 Same

13 Γ— 13 Γ— 384

3 Γ— 3

13 Γ— 13 Γ— 384

3 Γ— 3

13 Γ— 13 Γ— 256

3 Γ— 3 S=2

π‘π΅π‘Œ βˆ’ π‘„π‘π‘π‘š

= =

6 Γ— 6 Γ— 256 9216 4096 4096 𝑇𝑝𝑔𝑒𝑛𝑏𝑦 1000

Parameters:9216Γ— 4096 Γ—4096=154,618,822,656

slide-22
SLIDE 22

Thank you