deep learning
play

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - PowerPoint PPT Presentation

Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra Probability Calculus Optimization Linear Algebra Scalar real numbers Vector (1D) Has a magnitude & a


  1. Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10

  2. Applied Math for Deep Learning • Linear Algebra • Probability • Calculus • Optimization

  3. Linear Algebra • Scalar − real numbers • Vector (1D) − Has a magnitude & a direction • Matrix (2D) − An array of numbers arranges in rows & columns • Tensor (>=3D) − Multi-dimensional arrays of numbers

  4. Real-world examples of Data Tensors • Timeseries Data – 3D (samples, timesteps, features) • Images – 4D (samples, height, width, channels) • Video – 5D (samples, frames, height, width, channels) 4

  5. The Matrix

  6. Matrix • Define a matrix with m rows and n columns: Santanu Pattanayak , ”Pro Deep Learning with TensorFlow,” Apress, 2017

  7. Matrix Operations • Addition and Subtraction

  8. Matrix Multiplication • Two matrices A and B, where • The columns of A must be equal to the rows of B, i.e. n == p q n • A * B = C, where q • p m m

  9. Example of Matrix Multiplication (3-1) https://www.mathsisfun.com/algebra/matrix-multiplying.html

  10. Example of Matrix Multiplication (3-2) https://www.mathsisfun.com/algebra/matrix-multiplying.html

  11. Example of Matrix Multiplication (3-3) https://www.mathsisfun.com/algebra/matrix-multiplying.html

  12. Matrix Transpose https://en.wikipedia.org/wiki/Transpose

  13. Dot Product • Dot product of two vectors become a scalar • Notation: 𝑤 1 ∙ 𝑤 2 or 𝑤 1𝑈 𝑤 2

  14. Dot Product of Matrix

  15. Linear Independence • A vector is linearly dependent on other vectors if it can be expressed as the linear combination of other vectors • A set of vectors 𝑤 1 , 𝑤 2 , ⋯ , 𝑤 𝑜 is linearly independent if 𝑏 1 𝑤 1 + 𝑏 2 𝑤 2 + ⋯ + 𝑏 𝑜 𝑤 𝑜 = 0 implies all 𝑏 𝑗 = 0, ∀𝑗 ∈ {1,2, ⋯ 𝑜}

  16. Span the Vector Space • n linearly independent vectors can span n -dimensional space

  17. Rank of a Matrix • Rank is: − The number of linearly independent row or column vectors − The dimension of the vector space generated by its columns • Row rank = Column rank • Example: Row- echelon form https://en.wikipedia.org/wiki/Rank_(linear_algebra)

  18. Identity Matrix I • Any vector or matrix multiplied by I remains unchanged • For a matrix 𝐵 𝑛×𝑜 , 𝐵𝐽 𝑜 = 𝐽 𝑛 𝐵 = 𝐵

  19. Inverse of a Matrix • The product of a square matrix 𝐵 and its inverse matrix 𝐵 −1 produces the identity matrix 𝐽 • 𝐵𝐵 −1 = 𝐵 −1 𝐵 = 𝐽 • Inverse matrix is square, but not all square matrices has inverses

  20. Pseudo Inverse • Non-square matrix and have left-inverse or right-inverse matrix • Example: 𝐵𝑦 = 𝑐, 𝐵 ∈ ℝ 𝑛×𝑜 , 𝑐 ∈ ℝ 𝑜 − Create a square matrix 𝐵 𝑈 𝐵 𝐵 𝑈 𝐵𝑦 = 𝐵 𝑈 𝑐 − Multiplied both sides by inverse matrix (𝐵 𝑈 𝐵) −1 𝑦 = (𝐵 𝑈 𝐵) −1 𝐵 𝑈 𝑐 − (𝐵 𝑈 𝐵) −1 𝐵 𝑈 is the pseudo inverse function

  21. Norm • Norm is a measure of a vector’s magnitude • 𝑚 2 norm • 𝑚 1 norm • 𝑚 𝑞 norm • 𝑚 ∞ norm

  22. Unit norms in 2D Vectors • The set of all vectors of norm 1 in different 2D norms

  23. L1 and L2 Regularization subject to https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when

  24. Eigen Vectors • Eigenvector is a non-zero vector that changed by only a scalar factor λ when linear transform 𝐵 is applied to: 𝐵𝑦 = 𝜇𝑦, 𝐵 ∈ ℝ 𝑜×𝑜 , 𝑦 ∈ ℝ 𝑜 • 𝑦 are Eigenvectors and 𝜇 are Eigenvalues • One of the most important concepts for machine learning, ex: − Principle Component Analysis (PCA) − Eigenvector centrality − PageRank − …

  25. Example: Shear Mapping • Horizontal axis is the Eigenvector

  26. Power Iteration Method for Computing Eigenvector Start with random vector 𝑤 1. Calculate iteratively: 𝑤 (𝑙+1) = 𝐵 𝑙 𝑤 2. After 𝑤 𝑙 converges, 𝑤 (𝑙+1) ≅ 𝑤 𝑙 3. 𝑤 𝑙 will be the Eigenvector with largest Eigenvalue 4.

  27. NumPy for Linear Algebra • NumPy is the fundamental package for scientific computing with Python. It contains among other things: − a powerful N-dimensional array object − sophisticated (broadcasting) functions − tools for integrating C/C++ and Fortran code − useful linear algebra, Fourier transform, and random number capabilities

  28. Python & NumPy tutorial • http://cs231n.github.io/python-numpy-tutorial/ • Stanford CS231n: Convolutional Neural Networks for Visual Recognition − http://cs231n.stanford.edu/

  29. Create Tensors Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)

  30. Create 3D Tensor

  31. Attributes of a Tensor • Number of axes (dimensions) − x.ndim • Shape − This is a tuple of integers showing how many data the tensor has along each axis • Data type − uint8, float32 or float64

  32. Manipulating Tensors in Numpy 32

  33. Displaying the Fourth Digit 33

  34. Real-world examples of Data Tensors • Vector data – 2D (samples, features) • Timeseries Data – 3D (samples, timesteps, features) • Images – 4D (samples, height, width, channels) • Video – 5D (samples, frames, height, width, channels) 34

  35. Batch size & Epochs • A sample − A sample is a single row of data • Batch size − Number of samples used for one iteration of gradient descent − Batch size = 1: stochastic gradient descent − 1 < Batch size < all: mini-batch gradient descent − Batch size = all: batch gradient descent • Epoch − Number of times that the learning algorithm work through all training samples 35

  36. Element-wise Operations for Matrix • Operate on each element

  37. NumPy Operation for Matrix • Leverage the Basic Linear Algebra subprograms (BLAS) • BLAS is optimized using C or Fortran

  38. Broadcasting • Apply smaller tensor repeated to the extra axes of the larger tensor 38

  39. Tensor Dot

  40. Implementation of Dot Product

  41. Tensor Reshaping • Rearrange a tensor’s rows and columns to match a target shape

  42. Matrix Transposition • Transposing a matrix means exchanging its rows and its columns

  43. Unfolding the Manifold • Tensor operations are complex geometric transformation in high- dimensional space − Dimension reduction

  44. Differentiation OR

  45. Gradient of a Function • Gradient is a multi-variable generalization of the derivative • Apply partial derivatives • Example

  46. Hessian Matrix • Second-order partial derivatives

  47. Maxima and Minima for Univariate Function 𝑒𝑔(𝑦) • If = 0 , it’s a minima or a maxima point, then we study the 𝑒𝑦 second derivative: 𝑒 2 𝑔(𝑦) − If < 0 => Maxima 𝑒𝑦 2 𝑒 2 𝑔(𝑦) − If > 0 => Minima 𝑒𝑦 2 𝑒 2 𝑔(𝑦) − If = 0 => Point of reflection 𝑒𝑦 2 Minima

  48. Maxima and Minima for Multivariate Function • Computing the gradient and setting it to zero vector would give us the list of stationary points. • For a stationary point 𝑦 0 ∈ ℝ 𝑜 − If the Hessian matrix of the function at 𝑦 0 has both positive and negative eigen values, then 𝑦 0 is a saddle point − If the eigen values of the Hessian matrix are all positive then the stationary point is a local minima − If the eigen values are all negative then the stationary point is a local maxima

  49. Chain Rule 50

  50. Symbolic Differentiation Computation Graph: c = a + b d = b + 1 e = c*d 51

  51. Stochastic Gradient Descent 1. Draw a batch of training samples x and corresponding targets y 2. Run the network on x to obtain predictions y_pred 3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y 4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass). 5. Move the parameters a little in the opposite direction from the gradient: W -= step * gradient

  52. Gradient Descent along a 2D Surface

  53. Avoid Local Minimum using Momentum

  54. Basics of Probability

  55. Three Axioms of Probability 𝑂 • Given an Event 𝐹 in a sample space 𝑇, S =ڂ 𝑗=1 𝐹 𝑗 • First axiom − 𝑄 𝐹 ∈ ℝ, 0 ≤ 𝑄(𝐹) ≤ 1 • Second axiom − 𝑄 𝑇 = 1 • Third axiom − Additivity, any countable sequence of mutually exclusive events 𝐹 𝑗 − 𝑄ڂ 𝑗=1 𝑜 𝑜 𝐹 𝑗 = 𝑄 𝐹 1 + 𝑄 𝐹 2 + ⋯ + 𝑄 𝐹 𝑜 = σ 𝑗=1 𝑄 𝐹 𝑗

  56. Union, Intersection, and Conditional Probability • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶 • 𝑄 𝐵 ∩ 𝐶 is simplified as 𝑄 𝐵𝐶 • Conditional Probability 𝑄 𝐵|𝐶 , the probability of event A given B has occurred 𝐵𝐶 − 𝑄 𝐵|𝐶 = 𝑄 𝐶 − 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)

  57. Chain Rule of Probability • The joint probability can be expressed as chain rule

  58. Mutually Exclusive • 𝑄 𝐵𝐶 = 0 • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend