Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - PowerPoint PPT Presentation

Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10

Applied Math for Deep Learning • Linear Algebra • Probability • Calculus • Optimization

Linear Algebra • Scalar − real numbers • Vector (1D) − Has a magnitude & a direction • Matrix (2D) − An array of numbers arranges in rows & columns • Tensor (>=3D) − Multi-dimensional arrays of numbers

Real-world examples of Data Tensors • Timeseries Data – 3D (samples, timesteps, features) • Images – 4D (samples, height, width, channels) • Video – 5D (samples, frames, height, width, channels) 4

The Matrix

Matrix • Define a matrix with m rows and n columns: Santanu Pattanayak , ”Pro Deep Learning with TensorFlow,” Apress, 2017

Matrix Operations • Addition and Subtraction

Matrix Multiplication • Two matrices A and B, where • The columns of A must be equal to the rows of B, i.e. n == p q n • A * B = C, where q • p m m

Example of Matrix Multiplication (3-1) https://www.mathsisfun.com/algebra/matrix-multiplying.html

Matrix Transpose https://en.wikipedia.org/wiki/Transpose

Dot Product • Dot product of two vectors become a scalar • Notation: 𝑤 1 ∙ 𝑤 2 or 𝑤 1𝑈 𝑤 2

Dot Product of Matrix

Linear Independence • A vector is linearly dependent on other vectors if it can be expressed as the linear combination of other vectors • A set of vectors 𝑤 1 , 𝑤 2 , ⋯ , 𝑤 𝑜 is linearly independent if 𝑏 1 𝑤 1 + 𝑏 2 𝑤 2 + ⋯ + 𝑏 𝑜 𝑤 𝑜 = 0 implies all 𝑏 𝑗 = 0, ∀𝑗 ∈ {1,2, ⋯ 𝑜}

Span the Vector Space • n linearly independent vectors can span n -dimensional space

Rank of a Matrix • Rank is: − The number of linearly independent row or column vectors − The dimension of the vector space generated by its columns • Row rank = Column rank • Example: Row- echelon form https://en.wikipedia.org/wiki/Rank_(linear_algebra)

Identity Matrix I • Any vector or matrix multiplied by I remains unchanged • For a matrix 𝐵 𝑛×𝑜 , 𝐵𝐽 𝑜 = 𝐽 𝑛 𝐵 = 𝐵

Inverse of a Matrix • The product of a square matrix 𝐵 and its inverse matrix 𝐵 −1 produces the identity matrix 𝐽 • 𝐵𝐵 −1 = 𝐵 −1 𝐵 = 𝐽 • Inverse matrix is square, but not all square matrices has inverses

Pseudo Inverse • Non-square matrix and have left-inverse or right-inverse matrix • Example: 𝐵𝑦 = 𝑐, 𝐵 ∈ ℝ 𝑛×𝑜 , 𝑐 ∈ ℝ 𝑜 − Create a square matrix 𝐵 𝑈 𝐵 𝐵 𝑈 𝐵𝑦 = 𝐵 𝑈 𝑐 − Multiplied both sides by inverse matrix (𝐵 𝑈 𝐵) −1 𝑦 = (𝐵 𝑈 𝐵) −1 𝐵 𝑈 𝑐 − (𝐵 𝑈 𝐵) −1 𝐵 𝑈 is the pseudo inverse function

Norm • Norm is a measure of a vector’s magnitude • 𝑚 2 norm • 𝑚 1 norm • 𝑚 𝑞 norm • 𝑚 ∞ norm

Unit norms in 2D Vectors • The set of all vectors of norm 1 in different 2D norms

L1 and L2 Regularization subject to https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when

Eigen Vectors • Eigenvector is a non-zero vector that changed by only a scalar factor λ when linear transform 𝐵 is applied to: 𝐵𝑦 = 𝜇𝑦, 𝐵 ∈ ℝ 𝑜×𝑜 , 𝑦 ∈ ℝ 𝑜 • 𝑦 are Eigenvectors and 𝜇 are Eigenvalues • One of the most important concepts for machine learning, ex: − Principle Component Analysis (PCA) − Eigenvector centrality − PageRank − …

Example: Shear Mapping • Horizontal axis is the Eigenvector

Power Iteration Method for Computing Eigenvector Start with random vector 𝑤 1. Calculate iteratively: 𝑤 (𝑙+1) = 𝐵 𝑙 𝑤 2. After 𝑤 𝑙 converges, 𝑤 (𝑙+1) ≅ 𝑤 𝑙 3. 𝑤 𝑙 will be the Eigenvector with largest Eigenvalue 4.

NumPy for Linear Algebra • NumPy is the fundamental package for scientific computing with Python. It contains among other things: − a powerful N-dimensional array object − sophisticated (broadcasting) functions − tools for integrating C/C++ and Fortran code − useful linear algebra, Fourier transform, and random number capabilities

Python & NumPy tutorial • http://cs231n.github.io/python-numpy-tutorial/ • Stanford CS231n: Convolutional Neural Networks for Visual Recognition − http://cs231n.stanford.edu/

Create Tensors Scalars (0D tensors) Vectors (1D tensors) Matrices (2D tensors)

Create 3D Tensor

Attributes of a Tensor • Number of axes (dimensions) − x.ndim • Shape − This is a tuple of integers showing how many data the tensor has along each axis • Data type − uint8, float32 or float64

Manipulating Tensors in Numpy 32

Displaying the Fourth Digit 33

Real-world examples of Data Tensors • Vector data – 2D (samples, features) • Timeseries Data – 3D (samples, timesteps, features) • Images – 4D (samples, height, width, channels) • Video – 5D (samples, frames, height, width, channels) 34

Batch size & Epochs • A sample − A sample is a single row of data • Batch size − Number of samples used for one iteration of gradient descent − Batch size = 1: stochastic gradient descent − 1 < Batch size < all: mini-batch gradient descent − Batch size = all: batch gradient descent • Epoch − Number of times that the learning algorithm work through all training samples 35

Element-wise Operations for Matrix • Operate on each element

NumPy Operation for Matrix • Leverage the Basic Linear Algebra subprograms (BLAS) • BLAS is optimized using C or Fortran

Broadcasting • Apply smaller tensor repeated to the extra axes of the larger tensor 38

Tensor Dot

Implementation of Dot Product

Tensor Reshaping • Rearrange a tensor’s rows and columns to match a target shape

Matrix Transposition • Transposing a matrix means exchanging its rows and its columns

Unfolding the Manifold • Tensor operations are complex geometric transformation in high- dimensional space − Dimension reduction

Differentiation OR

Gradient of a Function • Gradient is a multi-variable generalization of the derivative • Apply partial derivatives • Example

Hessian Matrix • Second-order partial derivatives

Maxima and Minima for Univariate Function 𝑒𝑔(𝑦) • If = 0 , it’s a minima or a maxima point, then we study the 𝑒𝑦 second derivative: 𝑒 2 𝑔(𝑦) − If < 0 => Maxima 𝑒𝑦 2 𝑒 2 𝑔(𝑦) − If > 0 => Minima 𝑒𝑦 2 𝑒 2 𝑔(𝑦) − If = 0 => Point of reflection 𝑒𝑦 2 Minima

Maxima and Minima for Multivariate Function • Computing the gradient and setting it to zero vector would give us the list of stationary points. • For a stationary point 𝑦 0 ∈ ℝ 𝑜 − If the Hessian matrix of the function at 𝑦 0 has both positive and negative eigen values, then 𝑦 0 is a saddle point − If the eigen values of the Hessian matrix are all positive then the stationary point is a local minima − If the eigen values are all negative then the stationary point is a local maxima

Chain Rule 50

Symbolic Differentiation Computation Graph: c = a + b d = b + 1 e = c*d 51

Stochastic Gradient Descent 1. Draw a batch of training samples x and corresponding targets y 2. Run the network on x to obtain predictions y_pred 3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y 4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass). 5. Move the parameters a little in the opposite direction from the gradient: W -= step * gradient

Gradient Descent along a 2D Surface

Avoid Local Minimum using Momentum

Basics of Probability

Three Axioms of Probability 𝑂 • Given an Event 𝐹 in a sample space 𝑇, S =ڂ 𝑗=1 𝐹 𝑗 • First axiom − 𝑄 𝐹 ∈ ℝ, 0 ≤ 𝑄(𝐹) ≤ 1 • Second axiom − 𝑄 𝑇 = 1 • Third axiom − Additivity, any countable sequence of mutually exclusive events 𝐹 𝑗 − 𝑄ڂ 𝑗=1 𝑜 𝑜 𝐹 𝑗 = 𝑄 𝐹 1 + 𝑄 𝐹 2 + ⋯ + 𝑄 𝐹 𝑜 = σ 𝑗=1 𝑄 𝐹 𝑗

Union, Intersection, and Conditional Probability • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶 • 𝑄 𝐵 ∩ 𝐶 is simplified as 𝑄 𝐵𝐶 • Conditional Probability 𝑄 𝐵|𝐶 , the probability of event A given B has occurred 𝐵𝐶 − 𝑄 𝐵|𝐶 = 𝑄 𝐶 − 𝑄 𝐵𝐶 = 𝑄 𝐵|𝐶 𝑄 𝐶 = 𝑄 𝐶|𝐵 𝑄(𝐵)

Chain Rule of Probability • The joint probability can be expressed as chain rule

Mutually Exclusive • 𝑄 𝐵𝐶 = 0 • 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - PowerPoint PPT Presentation

Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra Probability Calculus Optimization Linear Algebra Scalar real numbers Vector (1D) Has a magnitude & a

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Quantum field theory on a quantum space-time: Hawking radiation and the Casimir effect Jorge

Wrap Up! Lecture 25 Decision Trees & Branching Programs Many Topics Not Covered! Decision

My favourite open problems in universal algebra Ross Willard University of Waterloo AMS Spring

Is 2020 Vision Good Enough? Looking Ahead to What Comes Next Cathy Seeley NCTMs 100

Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015 Part 1: Joint work

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Discovering Interesting Patterns Through Motivation Users Interactive Feedback

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep - PowerPoint PPT Presentation

Applied Math for Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra Probability Calculus Optimization Linear Algebra Scalar real numbers Vector (1D) Has a magnitude & a

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Quantum field theory on a quantum space-time: Hawking radiation and the Casimir effect Jorge

Wrap Up! Lecture 25 Decision Trees &amp; Branching Programs Many Topics Not Covered! Decision

My favourite open problems in universal algebra Ross Willard University of Waterloo AMS Spring

Is 2020 Vision Good Enough? Looking Ahead to What Comes Next Cathy Seeley NCTMs 100

Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015 Part 1: Joint work

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Discovering Interesting Patterns Through Motivation Users Interactive Feedback

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Wrap Up! Lecture 25 Decision Trees & Branching Programs Many Topics Not Covered! Decision