Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai

Outline • Installations • Deep Learning Motivations • DeepNumpy & Calculus • Regression • Optimization • Softmax Regression • Multilayer Perceptron (train MNIST) d2l.ai

Installations d2l.ai

Installations • Python • Everyone is using it in machine learning • Miniconda • Package manager (for simplicity) • Jupyter Notebook • So much easier to keep track of your experiments d2l.ai

Installations Detailed step-by-step instructions on local (Mac or Linux) : https://d2l.ai/chapter_install/ install.html d2l.ai

Deep Learning d2l.ai

Classify Images d2l.ai http://www.image-net.org/

Classify Images http://www.image-net.org/ Yanofsky, Quartz d2l.ai https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/

Detect and Segment Objects d2l.ai https://github.com/matterport/Mask_RCNN

Style Transfer https://github.com/zhanghang1989/MXNet-Gluon-Style-Transfer/ d2l.ai

Synthesize Faces d2l.ai Karras et al, arXiv 2019

Analogies https://nlp.stanford.edu/projects/glove/ d2l.ai

Machine Translation https://www.pcmag.com/news/349610/google-expands-neural-networks-for-language-translation d2l.ai

Text Synthesis d2l.ai Li et al, NACCL, 2018

Question answering Question Type: “Subordinate Object Recognition” Question Type Guided Vision Feature Extractor Attention Q: “What’s her Text Feature Extractor Combine mustache made of?” A: “Banana” Predictor d2l.ai Shi et al, 2018, Arxiv

Image captioning Shallue et al, 2016 https://ai.googleblog.com/2016/09/show-and-tell- d2l.ai image-captioning-open.html

Problems we will solve: Classification cat dog rabbit gerbil Given image x estimate label y y = f ( x ) where y ∈ {1,… N } d2l.ai

Problems we will solve: Regression 0.4kg 2kg 4kg 10kg Given image x estimate label y y = f ( x ) where y ∈ ℝ d2l.ai

Problems we will solve today: Sequence Models GPT2, 2019 d2l.ai

Deep d2l.ai

N-dimensional Arrays N-dimensional arrays are the main data structure for machine learning and neural networks 0-d (scalar) 1-d (vector) 2-d (matrix) 1.0 [1.0, 2.7, 3.4] [[1.0, 2.7, 3.4] [5.0, 0.2, 4.6] [4.3, 8.5, 0.2]] An example-by- A class label A feature vector feature matrix d2l.ai

N-dimensional Arrays 3-d 4-d 5-d [[[0.1, 2.7, 3.4] [5.0, 0.2, 4.6] [[[[. . . [[[[[. . . [4.3, 8.5, 0.2]] . . . . . . [[3.2, 5.7, 3.4] . . .]]]] . . .]]]]] [5.4, 6.2, 3.2] [4.1, 3.5, 6.2]]] A batch of A batch of videos A RGB image RGB images (batch-size x time x (width x height (batch-size x width x height x x channels) width x height channels) x channels) d2l.ai

Element-wise access element: [1, 2] row: [1, :] column: [1, :] 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 1 5 6 7 8 1 5 6 7 8 1 5 6 7 8 2 9 10 11 12 2 9 10 11 12 2 9 10 11 12 3 13 14 15 16 3 13 14 15 16 3 13 14 15 16 0 1 2 3 0 1 2 3 0 1 2 3 4 0 1 2 3 4 1 5 6 7 8 1 5 6 7 8 2 9 10 11 12 2 9 10 11 12 3 13 14 15 16 3 13 14 15 16 d2l.ai

d2l.ai

Calculus - Derivatives Derivative measures the sensitivity to change of the output value with respect to a change in its input value. E.g. slope of tangent. y x n a sin( x ) exp( x ) log( x ) 1 dy d nx n − 1 dx x 2 = 2 x exp( x ) cos( x ) 0 x dx y = f ( u ), u = g ( x ) y uv u + v x = 1 dy dx + dv du dx v + dv du dy du dx u dx dx du dx d2l.ai

Calculus - Non-differentiable Extend derivative to non-differentiable cases. y = | x | if x > 0 1 ∂ | x | − 1 if x < 0 = slope= - 1 slope=1 ∂ x if x = 0, a ∈ [ − 1,1] a x = 0 if x > 0 1 ∂ Another example: if x < 0 ∂ x max( x ,0) = 0 a if x = 0, a ∈ [0,1] d2l.ai

Calculus - Gradients Gradient is a multi-variable generalization of the derivative. Scalar Vector ∂ y ∂ x x ∈ ℝ n x ∂ y ∂ y y Scalar ∂ x ∂ x (1,) (1, n) ∂ y ∂ y y ∈ ℝ m Vector ∂ x ∂ x (m, 1) (m, n) d2l.ai

Derivatives for vectors x ∈ ℝ n x ∂ y ∂ x ∈ ℝ m × n x ∈ ℝ n , y ∈ ℝ m , ∂ y ∂ y y ∂ x ∂ x ∂ y / ∂ x x 1 y 1 ∂ y ∂ y x 2 y 2 y ∈ ℝ m x = y = ∂ x ∂ x ⋮ ⋮ x n y m ∂ y 1 ∂ y 1 ∂ y 1 ∂ y 1 ∂ x 1 , ∂ x 2 , …, ∂ x n ∂ x ∂ y = dy i ∂ y 2 ∂ y 2 ∂ y 2 ∂ y 2 [ ∂ x ] ij ∂ y ∂ x 1 , ∂ x 2 , …, ∂ x = = ∂ x n ∂ x dx j ⋮ ⋮ ∂ y m ∂ y m ∂ y m ∂ y m ∂ x 1 , ∂ x 2 , …, ∂ x ∂ x n d2l.ai

Derivatives for vectors Let’s do some exercise! E.g. sum ( x ) ∂ y ∥ x ∥ 2 au a y ∂ x ∈ ℝ 1 × n x ∈ ℝ n , y ∈ ℝ , a ∂ u ∂ y 0 and 1 are vectors 0 T 1 T 2 x T ∂ x ∂ x y uv ⟨ u , v ⟩ u + v ∂ y ∂ x + ∂ v ∂ u ∂ x v + ∂ v ∂ u u T ∂ v ∂ x + v T ∂ u ∂ x u ∂ x ∂ x ∂ x d2l.ai

Derivatives for vectors Let’s do some exercise! E.g. ∂ y ∂ x ∈ ℝ m × n y x ∈ ℝ n , y ∈ ℝ m , a x x T A Ax a , a and A are not functions of x ∂ y A T 0 I A 0 and I are matrices ∂ x y a u Au u + v ∂ y ∂ u ∂ x + ∂ v a ∂ u A ∂ u ∂ x ∂ x ∂ x ∂ x d2l.ai

Generalize to Matrices Scalar Vector Matrix x ( n ,1) X ( n , k ) x (1,) ∂ y ∂ y ∂ y (1, n ) ( k , n ) Scalar y (1,) (1,) ∂ X ∂ x ∂ x ∂ y ∂ y ∂ y Vector y ( m , n ) ( m ,1) ( m ,1) ( m , k , n ) ∂ x ∂ x ∂ X ∂ Y Matrix ∂ Y ∂ Y ( m , l , n ) ( m , l ) Y ( m , l ) ( m , l , k , n ) ∂ x ∂ x ∂ X d2l.ai

Chain Rule Scalars ∂ y ∂ x = ∂ y ∂ u y = f ( u ), u = g ( x ) ∂ u ∂ x Vectors ∂ y ∂ x = ∂ y ∂ u ∂ x = ∂ y ∂ y ∂ u ∂ x = ∂ y ∂ y ∂ u ∂ u ∂ x ∂ u ∂ x ∂ u ∂ x Shapes: (1,) (1, n ) (1, n ) (1, k ) ( k , n ) ( m , n ) ( m , k ) ( k , n ) (1, n ) Too many shapes to memory … d2l.ai

Automatic Differentiation Computing derivatives by hand is HARD. Chain rule (evaluate e.g. via backprop) ∂ u n … ∂ u 2 ∂ u 1 ∂ y ∂ x = ∂ y ∂ u n ∂ u n − 1 ∂ u 1 ∂ x Compute graph: • Build explicitly   (TensorFlow, MXNet Symbol) • Build implicitly by tracing   (Chainer, PyTorch, DeepNumpy) d2l.ai

Automatic Differentiation 2 z = ( ⟨ x , w ⟩ − y ) Computing derivatives by hand is HARD. z = b 2 4 Chain rule (evaluate e.g. via backprop) ∂ u n … ∂ u 2 ∂ u 1 ∂ y ∂ x = ∂ y b = a − y 3 ∂ u n ∂ u n − 1 ∂ u 1 ∂ x Compute graph: a = ⟨ x , w ⟩ 2 • Build explicitly   y (TensorFlow, MXNet Symbol) 1 • Build implicitly by tracing   (Chainer, PyTorch, DeepNumpy) w x d2l.ai

NumPy & AutoGrad notebook d2l.ai

Regression d2l.ai

$0.41 g3.4xlarge $0.73 g3.8xlarge Can we estimate $1.37 g3.16xlarge prices (time, server, region)? p = w time ⋅ t + w server ⋅ s + w region ⋅ r d2l.ai

̂ ̂ ̂ Linear Model • Basic version y = w 1 x 1 + w 2 x 2 + … + w n x n + b • Vectorized version y = ⟨ w , x ⟩ + b x = [ x 1 , x 2 , …, x n ] T • n -dimensional inputs w = [ w 1 , w 2 , …, w n ] ⊤ • Weights: • Bias, b • Vectorized version (Closed form) • Add bias as an element in weights w ← [ w b ] y = ⟨ w , x ⟩ X ← [ X , 1 ] d2l.ai

̂ ̂ ̂ Loss l 2 • Basic version • Basic version y ) = 1 y = w 1 x 1 + w 2 x 2 + … + w n x n + b 2 n ( y − ̂ y ) ℓ ( y , ̂ • Vectorized version • Vectorized version ℓ ( X , y , w , b ) = 1 2 y = ⟨ w , x ⟩ + b y − Xw − b n • Vectorized version (Closed form) • Vectorized version (Closed form) w ← [ w ℓ ( X , y , w ) = 1 b ] 2 X ← [ X , 1 ] y − Xw n y = ⟨ w , x ⟩ + b d2l.ai

  Objective Function • Basic version • Objective is to minimize training loss y ) = 1 2 1 n ( y − ̂ y ) ℓ ( y , ̂ 2 loss ⇔ argmin y − Xw argmin n w w • Vectorized version ℓ ( X , y , w , b ) = 1 2 ⇔ ∂ y − Xw − b ∂ w ℓ ( X , y , w ) = 0 n • Vectorized version (Closed form) ⇔ 2 T X = 0 n ( y − Xw ) ℓ ( X , y , w ) = 1 2 y − Xw n − 1 Xy ⇔ w * = ( X T X ) d2l.ai

Linear Model as a Single-layer Neural Network We can stack multiple layers to get deep neural networks d2l.ai

Linear Regression notebook d2l.ai

Optimization momentum negative gradient d2l.ai

Gradient Descent in 1D Consider some continuously differentiable real-valued function 𝑔 : ℝ→ℝ . Using a Taylor expansion we obtain that: f ( x + ϵ ) = f ( x ) + ϵ f ′ � ( x ) + O ( ϵ 2 ) ϵ = − η f ′ � ( x ) Assume we pick a fixed step size ( 𝜃 >0) and choose : f ( x − η f ′ � ( x )) = f ( x ) − η f ′ � 2 ( x ) + O ( η 2 f ′ � 2 ( x )) d2l.ai

Gradient Descent in 1D f ′ � ( x ) ≠ 0 If the derivative does not vanish, we make progress η f ′ � 2 ( x ) > 0 since . Moreover, we can always choose 𝜃 small enough for the higher order terms to become irrelevant. Hence we arrive at: f ( x − η f ′ � ( x )) ⪅ f ( x ) x ← x − η f ′ � ( x ) This means that, if we use to iterate 𝑦 , the value of function 𝑔 ( 𝑦 ) might decline. d2l.ai

Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai - PowerPoint PPT Presentation

Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Installations Deep Learning Motivations DeepNumpy & Calculus Regression Optimization Softmax Regression Multilayer Perceptron (train

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

MODULE 6 PLUMBING AND ELECTRICAL BASICS OF MODERN LABORATORY DESIGN 6 6 PLUMBING AND ELECTRICAL

MOL2NET , 2019 , 5, ISSN: 2624-5078 2 http://sciforum.net/conference/mol2net-05 Introduction The

NIFify: Towards Better Quality Entity Linking Datasets Henry Rosales-M endez, Aidan Hogan

Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford

Geranium renardii Kaip dalinsims? Globularia meridionalis Hatiora Gaertneri Images : du Net

Facilities for Testing Control Software Pieter J. Schoenmakers <tiggr@ics.ele.tue.nl>

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and

FREE TEXT OF WIKIPEDIA ARTICLES COMBINING MACHINE LEARNING WITH LEXICO-SYNTACTIC RULES TOM

Building Java Programs Chapter 16 Linked List Basics reading: 16.2 2 LinkedList Week*