Applied Math for Deep Learning
- Prof. Kuan-Ting Lai
2020/3/10
Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - - PowerPoint PPT Presentation
Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra Probability Calculus Optimization Linear Algebra Scalar real numbers Vector (1D) Has a magnitude & a
2020/3/10
− real numbers
− Has a magnitude & a direction
− An array of numbers arranges in rows & columns
− Multi-dimensional arrays of numbers
4
and n columns:
Santanu Pattanayak, ”Pro Deep Learning with TensorFlow,” Apress, 2017
n p q q m
https://www.mathsisfun.com/algebra/matrix-multiplying.html
https://www.mathsisfun.com/algebra/matrix-multiplying.html
https://www.mathsisfun.com/algebra/matrix-multiplying.html
https://en.wikipedia.org/wiki/Transpose
as the linear combination of other vectors
𝑏2𝑤2 + ⋯ + 𝑏𝑜𝑤𝑜 = 0 implies all 𝑏𝑗 = 0, ∀𝑗 ∈ {1,2, ⋯ 𝑜}
n-dimensional space
− The number of linearly independent row or column vectors − The dimension of the vector space generated by its columns
https://en.wikipedia.org/wiki/Rank_(linear_algebra)
Row- echelon form
produces the identity matrix 𝐽
− Create a square matrix 𝐵𝑈𝐵 − Multiplied both sides by inverse matrix (𝐵𝑈𝐵)−1 − (𝐵𝑈𝐵)−1𝐵𝑈 is the pseudo inverse function
𝐵𝑦 = 𝑐, 𝐵 ∈ ℝ𝑛×𝑜, 𝑐 ∈ ℝ𝑜 𝐵𝑈𝐵𝑦 = 𝐵𝑈𝑐 𝑦 = (𝐵𝑈𝐵)−1𝐵𝑈𝑐
https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when
subject to
when linear transform 𝐵 is applied to:
− Principle Component Analysis (PCA) − Eigenvector centrality − PageRank − …
𝐵𝑦 = 𝜇𝑦, 𝐵 ∈ ℝ𝑜×𝑜, 𝑦 ∈ ℝ𝑜
Eigenvector
with Python. It contains among other things: −a powerful N-dimensional array object −sophisticated (broadcasting) functions −tools for integrating C/C++ and Fortran code −useful linear algebra, Fourier transform, and random number capabilities
for Visual Recognition
− http://cs231n.stanford.edu/
Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)
− x.ndim
− This is a tuple of integers showing how many data the tensor has along each axis
− uint8, float32 or float64
32
33
34
− A sample is a single row of data
− Number of samples used for one iteration of gradient descent − Batch size = 1: stochastic gradient descent − 1 < Batch size < all: mini-batch gradient descent − Batch size = all: batch gradient descent
− Number of times that the learning algorithm work through all training samples
35
38
columns to match a target shape
dimensional space
− Dimension reduction
𝑒𝑔(𝑦) 𝑒𝑦
= 0, it’s a minima or a maxima point, then we study the second derivative:
− If
𝑒2𝑔(𝑦) 𝑒𝑦2
< 0 => Maxima − If
𝑒2𝑔(𝑦) 𝑒𝑦2
> 0 => Minima − If
𝑒2𝑔(𝑦) 𝑒𝑦2
= 0 => Point of reflection
Minima
give us the list of stationary points.
−If the Hessian matrix of the function at 𝑦0 has both positive and negative eigen values, then 𝑦0 is a saddle point −If the eigen values of the Hessian matrix are all positive then the stationary point is a local minima −If the eigen values are all negative then the stationary point is a local maxima
50
51
corresponding targets y
y_pred
measure of the mismatch between y_pred and y
the network’s parameters (a backward pass).
direction from the gradient: W -= step * gradient
𝑂
𝐹𝑗
− 𝑄 𝐹 ∈ ℝ, 0 ≤ 𝑄(𝐹) ≤ 1
− 𝑄 𝑇 = 1
− Additivity, any countable sequence of mutually exclusive events 𝐹𝑗 − 𝑄ڂ𝑗=1
𝑜
𝐹𝑗 = 𝑄 𝐹1 + 𝑄 𝐹2 + ⋯ + 𝑄 𝐹𝑜 = σ𝑗=1
𝑜
𝑄 𝐹𝑗
− 𝑄 𝐵|𝐶 = 𝑄
𝐵𝐶 𝐶
− 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)
their intersection is equal to the product of their individual probabilities
− 𝑄 𝐵𝐶 = 𝑄 𝐵 𝑄 𝐶 − 𝑄 𝐵|𝐶 = 𝑄 𝐵
𝑄 𝐶|𝐵 𝑄(𝐵) 𝑄(𝐶)
− Remember 𝑄 𝐵|𝐶 = 𝑄
𝐵𝐶 𝐶
− So 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵) − Then Bayes 𝑄 𝐵|𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)/𝑄 𝐶
− Function that gives the probability that a discrete random variable is exactly equal to some value
− Specify the probability of the random variable falling within a particular range
න
𝐸
𝑄 𝑦 𝑒𝑦 = 1 𝑄 𝑌 = 𝑗 = 1 6 , 𝑗 ∈ {1,2,3,4,5,6}
𝐹 𝑌 = 𝑦1𝑞1 + 𝑦2𝑞2 + ⋯ + 𝑦𝑜𝑞𝑜 =
𝑗=1 𝑜
𝑦𝑗𝑞𝑗 𝐹 𝑌 = න
𝐸
𝑦𝑄 𝑦 𝑒𝑦
𝑊𝑏𝑠 𝑌 = 𝐹 𝑌 − 𝜈 2 , where μ = 𝐹[𝑌] 𝑊𝑏𝑠 𝑌 = න
𝐸
(𝑦 − 𝜈)2𝑄 𝑦 𝑒𝑦
𝐷𝑝𝑤 𝑌, 𝑍 = 𝐹 𝑌 − 𝜈𝑦 𝑍 − 𝜈𝑧 , where 𝜈𝑦 = 𝐹[𝑌], 𝜈𝑧 = 𝐹[𝑍] 𝜍 = 𝑑𝑝𝑤(𝑌, 𝑍) 𝜏𝑦𝜏𝑧
− Averages of samples of observations of random variables independently drawn from independent distributions converge to the normal distribution
https://en.wikipedia.org/wiki/Optimization_problem
𝑥
2+ 𝜇 𝒙
𝑔 𝒚 = 𝒙𝑈𝒚 + b (𝑔 𝒚 − 𝒛)2
Blocks of Neural Networks”
How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when