Presentation about Deep Learning
- -- Zhongwu xie
Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief - - PowerPoint PPT Presentation
Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks. Deep learning I . Introduction to Deep Learning
1.Brief introduction of Deep learning. 2.Brief introduction of Backpropagation. 3.Brief introduction of Convolutional Neural Networks.
Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts , with each concept defined in relation to simpler concepts , and more abstract representations computed in terms of less abstract ones.---Ian Goodfellow
In the plot on the left , A Venn diagram showing how deep learning is a kind of representation learning , which is in turn of machine learning. In the plot on the left ,the graph shows that deep learning has Multilayer.
1 β€ π β€ π
π Οπ=1 π π(π§π, π(π¦π, π))
Learning is cast as optimization.
Basing on HMM-GMM in 1990s : about 26% Restricted Boltzmann machines(RBMs) in 2009: 20.7%; LSTM-RNN in 2013:17.7%
famous Instances : self-driven AlphaGo
layer0 layer1 layer2
ππ π
[2]
π₯ππ¦ + π π¨ π(π¦) π
π = ΰ· π§ π¨ = π₯ππ¦ + π π = π(π¨) π¦1 π¦3 π¦2 π¦1 π¦2 π¦3
π¦1 π¦3 π¦2
ΰ· π§ = π
π¨1
[1] = π₯1 1 ππ¦ + π2 [1],
π1
[1] = π(π¨1 [1])
π¨2
[1] = π₯2 1 ππ¦ + π2 [1],
π2
[1] = π(π¨2 [1])
π¨3
[1] = π₯3 1 ππ¦ + π3 [1],
π3
[1] = π(π¨3 [1])
π¨4
[1] = π₯4 1 ππ¦ + π4 [1],
π4
[1] = π(π¨4 [1])
π¨[1] = π₯11
[1]
π₯12
[1]
π₯13
[1]
π₯14
[1]
π₯21
[1]
π₯22
[1]
π₯23
[1]
π₯24
[1]
π₯31
[1]
π₯32
[1]
π₯33
[1]
π₯34
[1]
π¦1 π¦2 π¦3 + π1
[1]
π2
[1]
π3
[1]
π4
[1]
=
Οπ=1
3
π₯π1
1 π¦π + π1 [1]
Οπ=1
3
π₯π2
1 π¦π + π2 [1]
Οπ=1
3
π₯π3
1 π¦π + π3 [1]
Οπ=1
3
π₯π4
1 π¦π + π4 [1]
=
π¨1
[1]
π¨2
[1]
π¨3
[1]
π¨4
[1]
T
π₯[1]
π[1] = π1
[1]
π2
[1]
π3
[1]
π4
[1]
= π π¨ 1 , where π π¦ is π’βe sigmoid function
πππ‘π’ ππ£πππ’πππ: π π, π§ ππ₯[1] =
ππ(π,π§) ππ₯[1] , ππ[1] = ππ(π,π§) ππ[1]
π¦ π₯ π
π¨ = π₯ππ¦ + π π = π (z) π π, π§ If π¦ = π π₯ , π§ = π π¦ , π¨ = π(π§) So,
ππ¨ ππ₯ = ππ¨ ππ§ ππ§ ππ¦ ππ¦ ππ₯
can use the chain rule to the gradient of the neural network.
π¦ π₯[1] π[1]
π¨[1] = π₯[1]π¦ + π[1] π[1] = π(π¨[1]) π π[2], π§ ππ[2] =
ππ(π,π§) ππ[2] =β π§ π + 1βπ§ 1βπ
ππ¨[2] =
ππ(π,π§) ππ¨[2] = ππ(π,π§) ππ[2] Γ ππ[2] ππ¨[2] = π[2] β π§
ππ₯[2] =
ππ(π,π§) ππ₯[2] = ππ(π,π§) π[2]
Γ
ππ[2] ππ¨[2] Γ ππ¨[2] ππ₯[2] = ππ¨[2]π 1 π
ππ[2] = ππ(π, π§) ππ[2] = ππ(π, π§) π[2] Γ ππ[2] ππ¨[2] Γ ππ¨[2] ππ[2] = ππ¨[2] ππ¨[1] = ππ(π, π§) ππ¨[1] = ππ(π, π§) π[2] Γ ππ[2] ππ¨[2] Γ ππ¨[2] ππ[1] Γ ππ[1] ππ¨[1] = π₯ 2 πππ¨[2]* πβ²(π¨[1]) π¨[2] = π₯[2]π[1] + π[2] π[2] = π(π¨[2])
π₯[2] π[2]
ππ₯[1] =
ππ(π,π§) ππ₯[1] = ππ(π,π§) ππ[2] Γ ππ[2] ππ¨[2] Γ ππ¨[2] ππ[1] Γ ππ[1] ππ¨[1] Γ ππ¨[1] ππ₯[1]=ππ¨[1]π¦π
ππ[1] = ππ(π, π§) ππ[1] = ππ(π, π§) π[2] Γ ππ[2] ππ¨[2] Γ ππ¨[2] ππ[1] Γ ππ[1] ππ¨[1] Γ ππ¨[1] ππ[1] =ππ¨[1] π π, π§ = β[π§ππππ + 1 β π§ log 1 β π ]
β¦ β¦ β¦ β¦ β¦ β¦ β¦ β¦ βπ·
πππ
[π]
ππ₯
ππ π
ππ· ππ₯
ππ π =
ΰ·
mnπ..π
ππ· πππ
π
πππ
π
πππ
πβ1
πππ
πβ1
πππ
πβ2 βββ πππ π+1
πππ
π
πππ
[π]
ππ₯
ππ π
πππ
[π+1]
πππ
π
βπ· β ΰ·
mnπ..π
ππ· πππ
π
πππ
π
πππ
πβ1
πππ
πβ1
πππ
πβ2 βββ πππ π+1
πππ
π
πππ
[π]
ππ₯
ππ π βπ₯ ππ π
ππ· πππ
π
The backpropagation algorithm is a clever way of keeping track of small perturbations the weights (and biases) as they propagate through the network , reach the output , and then affect the cost.
1.Input π¦:Set the corresponding activation for the input layer. 2.Feedforward : For each π = π, π, β¦ , π compute π¨[π]=π₯[π]π[πβ1] + π[π] and π[π]= π π¨ π . 3.Output error ππ¨[π]:ππ¨[π]=π[π]- π§. 4.Back propagate the cost error:For each l=L-1,L-2,β¦2 compute : dz[π]=(w π+1 ) dz[π+1] β πβ²(z[π]) 5.Output : The gradient of the cost function is given byοΌ ππ₯[π] =
ππ(π,π§) ππ₯[π] =ππ¨[π]π πβ1 π and ππ[π] = ππ(π,π§) ππ[π] = ππ¨[π]
T
Update the π₯
ππ [π]and π π [π]οΌ
π₯
ππ [π]=π₯ ππ [π] β π½ ππ(π,π§) ππ₯ππ
[π]
π
π [π]= π π [π]βπ½ ππ(π,π§) π ππ
[π]
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
* =
30 30 30 30 30 30 30 30 10 10 10 10 10 10 10 10 10 1
1
1
*
1
1
1
=
* * = =
6 Γ 6 Γ 3 3 Γ 3 Γ 3 3 Γ 3 Γ 3 4 Γ 4 4 Γ 4 4 Γ 4 Γ2
Why convolutionsοΌ
1 3 2 1 2 9 1 1 1 3 2 3 5 6 1 2 9 2 6 3 Hyperparameters: f:filter size s:stride Max or average pooling
Max pool with 2 Γ2 filters and stride 2
fit
The CNNs help extract certain features from the image , then fully connected layer is able to generalize from these features into the output-space.
[LeCun et al.,1998.Gradient-based learning applied to document recognition.]
11 Γ11 S=4
55 Γ 55 Γ 96 227 Γ 227 Γ 3
3 Γ 3 S=4
ππ΅π β ππππ
5 Γ 5 Same
27 Γ 27 Γ 96 27 Γ 27 Γ 256
ππ΅π β ππππ
13 Γ 13 Γ 256
3 Γ 3 S=2 3 Γ 3 Same
13 Γ 13 Γ 384
3 Γ 3
13 Γ 13 Γ 384
3 Γ 3
13 Γ 13 Γ 256
3 Γ 3 S=2
ππ΅π β ππππ
= =
6 Γ 6 Γ 256 9216 4096 4096 ππππ’πππ¦ 1000
Parameters:9216Γ 4096 Γ4096=154,618,822,656