Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - - PowerPoint PPT Presentation

deep learning
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - - PowerPoint PPT Presentation

Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra Probability Calculus Optimization Linear Algebra Scalar real numbers Vector (1D) Has a magnitude & a


slide-1
SLIDE 1

Applied Math for Deep Learning

  • Prof. Kuan-Ting Lai

2020/3/10

slide-2
SLIDE 2

Applied Math for Deep Learning

  • Linear Algebra
  • Probability
  • Calculus
  • Optimization
slide-3
SLIDE 3

Linear Algebra

  • Scalar

− real numbers

  • Vector (1D)

− Has a magnitude & a direction

  • Matrix (2D)

− An array of numbers arranges in rows & columns

  • Tensor (>=3D)

− Multi-dimensional arrays of numbers

slide-4
SLIDE 4

Real-world examples of Data Tensors

  • Timeseries Data – 3D (samples, timesteps, features)
  • Images – 4D (samples, height, width, channels)
  • Video – 5D (samples, frames, height, width, channels)

4

slide-5
SLIDE 5

The Matrix

slide-6
SLIDE 6

Matrix

  • Define a matrix with m rows

and n columns:

Santanu Pattanayak, ”Pro Deep Learning with TensorFlow,” Apress, 2017

slide-7
SLIDE 7

Matrix Operations

  • Addition and Subtraction
slide-8
SLIDE 8

Matrix Multiplication

  • Two matrices A and B, where
  • The columns of A must be equal to the rows of B, i.e. n == p
  • A * B = C, where
  • m

n p q q m

slide-9
SLIDE 9

Example of Matrix Multiplication (3-1)

https://www.mathsisfun.com/algebra/matrix-multiplying.html

slide-10
SLIDE 10

Example of Matrix Multiplication (3-2)

https://www.mathsisfun.com/algebra/matrix-multiplying.html

slide-11
SLIDE 11

Example of Matrix Multiplication (3-3)

https://www.mathsisfun.com/algebra/matrix-multiplying.html

slide-12
SLIDE 12

Matrix Transpose

https://en.wikipedia.org/wiki/Transpose

slide-13
SLIDE 13

Dot Product

  • Dot product of two vectors become a scalar
  • Notation: 𝑤1 ∙ 𝑤2 or 𝑤1𝑈𝑤2
slide-14
SLIDE 14

Dot Product of Matrix

slide-15
SLIDE 15

Linear Independence

  • A vector is linearly dependent on other vectors if it can be expressed

as the linear combination of other vectors

  • A set of vectors 𝑤1, 𝑤2,⋯ , 𝑤𝑜 is linearly independent if 𝑏1𝑤1 +

𝑏2𝑤2 + ⋯ + 𝑏𝑜𝑤𝑜 = 0 implies all 𝑏𝑗 = 0, ∀𝑗 ∈ {1,2, ⋯ 𝑜}

slide-16
SLIDE 16

Span the Vector Space

  • n linearly independent vectors can span

n-dimensional space

slide-17
SLIDE 17

Rank of a Matrix

  • Rank is:

− The number of linearly independent row or column vectors − The dimension of the vector space generated by its columns

  • Row rank = Column rank
  • Example:

https://en.wikipedia.org/wiki/Rank_(linear_algebra)

Row- echelon form

slide-18
SLIDE 18

Identity Matrix I

  • Any vector or matrix multiplied by I remains unchanged
  • For a matrix 𝐵𝑛×𝑜, 𝐵𝐽𝑜 = 𝐽𝑛𝐵 = 𝐵
slide-19
SLIDE 19

Inverse of a Matrix

  • The product of a square matrix 𝐵 and its inverse matrix 𝐵−1

produces the identity matrix 𝐽

  • 𝐵𝐵−1 = 𝐵−1𝐵 = 𝐽
  • Inverse matrix is square, but not all square matrices has inverses
slide-20
SLIDE 20

Pseudo Inverse

  • Non-square matrix and have left-inverse or right-inverse matrix
  • Example:

− Create a square matrix 𝐵𝑈𝐵 − Multiplied both sides by inverse matrix (𝐵𝑈𝐵)−1 − (𝐵𝑈𝐵)−1𝐵𝑈 is the pseudo inverse function

𝐵𝑦 = 𝑐, 𝐵 ∈ ℝ𝑛×𝑜, 𝑐 ∈ ℝ𝑜 𝐵𝑈𝐵𝑦 = 𝐵𝑈𝑐 𝑦 = (𝐵𝑈𝐵)−1𝐵𝑈𝑐

slide-21
SLIDE 21

Norm

  • Norm is a measure of a vector’s magnitude
  • 𝑚2 norm
  • 𝑚1 norm
  • 𝑚𝑞 norm
  • 𝑚∞ norm
slide-22
SLIDE 22

Unit norms in 2D Vectors

  • The set of all vectors of norm 1 in different 2D norms
slide-23
SLIDE 23

L1 and L2 Regularization

https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when

subject to

slide-24
SLIDE 24

Eigen Vectors

  • Eigenvector is a non-zero vector that changed by only a scalar factor λ

when linear transform 𝐵 is applied to:

  • 𝑦 are Eigenvectors and 𝜇 are Eigenvalues
  • One of the most important concepts for machine learning, ex:

− Principle Component Analysis (PCA) − Eigenvector centrality − PageRank − …

𝐵𝑦 = 𝜇𝑦, 𝐵 ∈ ℝ𝑜×𝑜, 𝑦 ∈ ℝ𝑜

slide-25
SLIDE 25

Example: Shear Mapping

  • Horizontal axis is the

Eigenvector

slide-26
SLIDE 26

Power Iteration Method for Computing Eigenvector

1. Start with random vector 𝑤 2. Calculate iteratively: 𝑤(𝑙+1) = 𝐵𝑙𝑤 3. After 𝑤𝑙 converges, 𝑤(𝑙+1) ≅ 𝑤𝑙 4. 𝑤𝑙 will be the Eigenvector with largest Eigenvalue

slide-27
SLIDE 27

NumPy for Linear Algebra

  • NumPy is the fundamental package for scientific computing

with Python. It contains among other things: −a powerful N-dimensional array object −sophisticated (broadcasting) functions −tools for integrating C/C++ and Fortran code −useful linear algebra, Fourier transform, and random number capabilities

slide-28
SLIDE 28

Python & NumPy tutorial

  • http://cs231n.github.io/python-numpy-tutorial/
  • Stanford CS231n: Convolutional Neural Networks

for Visual Recognition

− http://cs231n.stanford.edu/

slide-29
SLIDE 29

Create Tensors

Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)

slide-30
SLIDE 30

Create 3D Tensor

slide-31
SLIDE 31

Attributes of a Tensor

  • Number of axes (dimensions)

− x.ndim

  • Shape

− This is a tuple of integers showing how many data the tensor has along each axis

  • Data type

− uint8, float32 or float64

slide-32
SLIDE 32

Manipulating Tensors in Numpy

32

slide-33
SLIDE 33

Displaying the Fourth Digit

33

slide-34
SLIDE 34

Real-world examples of Data Tensors

  • Vector data – 2D (samples, features)
  • Timeseries Data – 3D (samples, timesteps, features)
  • Images – 4D (samples, height, width, channels)
  • Video – 5D (samples, frames, height, width, channels)

34

slide-35
SLIDE 35

Batch size & Epochs

  • A sample

− A sample is a single row of data

  • Batch size

− Number of samples used for one iteration of gradient descent − Batch size = 1: stochastic gradient descent − 1 < Batch size < all: mini-batch gradient descent − Batch size = all: batch gradient descent

  • Epoch

− Number of times that the learning algorithm work through all training samples

35

slide-36
SLIDE 36

Element-wise Operations for Matrix

  • Operate on each element
slide-37
SLIDE 37

NumPy Operation for Matrix

  • Leverage the Basic Linear Algebra subprograms (BLAS)
  • BLAS is optimized using C or Fortran
slide-38
SLIDE 38

Broadcasting

  • Apply smaller tensor repeated to the extra axes of the larger tensor

38

slide-39
SLIDE 39

Tensor Dot

slide-40
SLIDE 40

Implementation of Dot Product

slide-41
SLIDE 41

Tensor Reshaping

  • Rearrange a tensor’s rows and

columns to match a target shape

slide-42
SLIDE 42

Matrix Transposition

  • Transposing a matrix means exchanging its rows and its columns
slide-43
SLIDE 43

Unfolding the Manifold

  • Tensor operations are complex geometric transformation in high-

dimensional space

− Dimension reduction

slide-44
SLIDE 44
slide-45
SLIDE 45

Differentiation

OR

slide-46
SLIDE 46

Gradient of a Function

  • Gradient is a multi-variable generalization of the derivative
  • Apply partial derivatives
  • Example
slide-47
SLIDE 47

Hessian Matrix

  • Second-order partial derivatives
slide-48
SLIDE 48

Maxima and Minima for Univariate Function

  • If

𝑒𝑔(𝑦) 𝑒𝑦

= 0, it’s a minima or a maxima point, then we study the second derivative:

− If

𝑒2𝑔(𝑦) 𝑒𝑦2

< 0 => Maxima − If

𝑒2𝑔(𝑦) 𝑒𝑦2

> 0 => Minima − If

𝑒2𝑔(𝑦) 𝑒𝑦2

= 0 => Point of reflection

Minima

slide-49
SLIDE 49

Maxima and Minima for Multivariate Function

  • Computing the gradient and setting it to zero vector would

give us the list of stationary points.

  • For a stationary point 𝑦0 ∈ ℝ𝑜

−If the Hessian matrix of the function at 𝑦0 has both positive and negative eigen values, then 𝑦0 is a saddle point −If the eigen values of the Hessian matrix are all positive then the stationary point is a local minima −If the eigen values are all negative then the stationary point is a local maxima

slide-50
SLIDE 50

Chain Rule

50

slide-51
SLIDE 51

Symbolic Differentiation Computation Graph: c = a + b d = b + 1 e = c*d

51

slide-52
SLIDE 52

Stochastic Gradient Descent

  • 1. Draw a batch of training samples x and

corresponding targets y

  • 2. Run the network on x to obtain predictions

y_pred

  • 3. Compute the loss of the network on the batch, a

measure of the mismatch between y_pred and y

  • 4. Compute the gradient of the loss with regard to

the network’s parameters (a backward pass).

  • 5. Move the parameters a little in the opposite

direction from the gradient: W -= step * gradient

slide-53
SLIDE 53

Gradient Descent along a 2D Surface

slide-54
SLIDE 54

Avoid Local Minimum using Momentum

slide-55
SLIDE 55

Basics of Probability

slide-56
SLIDE 56

Three Axioms of Probability

  • Given an Event 𝐹 in a sample space 𝑇, S =ڂ𝑗=1

𝑂

𝐹𝑗

  • First axiom

− 𝑄 𝐹 ∈ ℝ, 0 ≤ 𝑄(𝐹) ≤ 1

  • Second axiom

− 𝑄 𝑇 = 1

  • Third axiom

− Additivity, any countable sequence of mutually exclusive events 𝐹𝑗 − 𝑄ڂ𝑗=1

𝑜

𝐹𝑗 = 𝑄 𝐹1 + 𝑄 𝐹2 + ⋯ + 𝑄 𝐹𝑜 = σ𝑗=1

𝑜

𝑄 𝐹𝑗

slide-57
SLIDE 57

Union, Intersection, and Conditional Probability

  • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶
  • 𝑄 𝐵 ∩ 𝐶 is simplified as 𝑄 𝐵𝐶
  • Conditional Probability 𝑄 𝐵|𝐶 , the probability of event A given B has
  • ccurred

− 𝑄 𝐵|𝐶 = 𝑄

𝐵𝐶 𝐶

− 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)

slide-58
SLIDE 58

Chain Rule of Probability

  • The joint probability can be expressed as chain rule
slide-59
SLIDE 59

Mutually Exclusive

  • 𝑄 𝐵𝐶 = 0
  • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶
slide-60
SLIDE 60

Independence of Events

  • Two events A and B are said to be independent if the probability of

their intersection is equal to the product of their individual probabilities

− 𝑄 𝐵𝐶 = 𝑄 𝐵 𝑄 𝐶 − 𝑄 𝐵|𝐶 = 𝑄 𝐵

slide-61
SLIDE 61

Bayes Rule

  • 𝑄 𝐵|𝐶 =

𝑄 𝐶|𝐵 𝑄(𝐵) 𝑄(𝐶)

  • Proof:

− Remember 𝑄 𝐵|𝐶 = 𝑄

𝐵𝐶 𝐶

− So 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵) − Then Bayes 𝑄 𝐵|𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)/𝑄 𝐶

slide-62
SLIDE 62

Probability Mass Function and Dense Function

  • Probability mass function (PMF)

− Function that gives the probability that a discrete random variable is exactly equal to some value

  • Probability dense function (PDF)

− Specify the probability of the random variable falling within a particular range

  • f values

𝐸

𝑄 𝑦 𝑒𝑦 = 1 𝑄 𝑌 = 𝑗 = 1 6 , 𝑗 ∈ {1,2,3,4,5,6}

slide-63
SLIDE 63

Expectation of a Random Variable

  • Expectation of a discrete random variable
  • Expectation of a continuous random variable

𝐹 𝑌 = 𝑦1𝑞1 + 𝑦2𝑞2 + ⋯ + 𝑦𝑜𝑞𝑜 = ෍

𝑗=1 𝑜

𝑦𝑗𝑞𝑗 𝐹 𝑌 = න

𝐸

𝑦𝑄 𝑦 𝑒𝑦

slide-64
SLIDE 64

Variance of a Random Variable

  • Expectation of a discrete random variable
  • Expectation of a continuous random variable
  • Standard deviation 𝜏 is the square root of variance

𝑊𝑏𝑠 𝑌 = 𝐹 𝑌 − 𝜈 2 , where μ = 𝐹[𝑌] 𝑊𝑏𝑠 𝑌 = න

𝐸

(𝑦 − 𝜈)2𝑄 𝑦 𝑒𝑦

slide-65
SLIDE 65

Covariance and Correlation Coefficient

  • Expectation of a discrete random variable
  • Correlation coefficient

𝐷𝑝𝑤 𝑌, 𝑍 = 𝐹 𝑌 − 𝜈𝑦 𝑍 − 𝜈𝑧 , where 𝜈𝑦 = 𝐹[𝑌], 𝜈𝑧 = 𝐹[𝑍] 𝜍 = 𝑑𝑝𝑤(𝑌, 𝑍) 𝜏𝑦𝜏𝑧

slide-66
SLIDE 66

Normal (Gaussian) Distribution

  • One of the most important distributions
  • Central limit theorem

− Averages of samples of observations of random variables independently drawn from independent distributions converge to the normal distribution

slide-67
SLIDE 67

Optimization

https://en.wikipedia.org/wiki/Optimization_problem

slide-68
SLIDE 68

Formulate Your Problem

  • Linear model:
  • Least-squared Error:
  • Regularization: 𝒙
  • Objective function:

min.

𝑥

𝒙𝑈𝒚 − 𝒛

2+ 𝜇 𝒙

𝑔 𝒚 = 𝒙𝑈𝒚 + b (𝑔 𝒚 − 𝒛)2

slide-69
SLIDE 69

References

  • Francois Chollet, “Deep Learning with Python,” Chapter 2 “Mathematical Building

Blocks of Neural Networks”

  • Santanu Pattanayak, ”Pro Deep Learning with TensorFlow,” Apress, 2017
  • Machine Learning Cheat Sheet
  • https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
  • https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-

How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when

  • Wikipedia