Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai - - PowerPoint PPT Presentation

deep learning basics
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai - - PowerPoint PPT Presentation

Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Installations Deep Learning Motivations DeepNumpy & Calculus Regression Optimization Softmax Regression Multilayer Perceptron (train


slide-1
SLIDE 1

d2l.ai

Deep Learning Basics

Rachel Hu and Zhi Zhang

Amazon AI

slide-2
SLIDE 2

d2l.ai

Outline

  • Installations
  • Deep Learning Motivations
  • DeepNumpy & Calculus
  • Regression
  • Optimization
  • Softmax Regression
  • Multilayer Perceptron (train MNIST)
slide-3
SLIDE 3

d2l.ai

Installations

slide-4
SLIDE 4

d2l.ai

Installations

  • Python
  • Everyone is using it in machine learning
  • Miniconda
  • Package manager (for simplicity)
  • Jupyter Notebook
  • So much easier to keep track of your experiments
slide-5
SLIDE 5

d2l.ai

Installations

https://d2l.ai/chapter_install/ install.html

Detailed step-by-step instructions on local (Mac or Linux):

slide-6
SLIDE 6

d2l.ai

Deep Learning

slide-7
SLIDE 7

d2l.ai

Classify Images

http://www.image-net.org/

slide-8
SLIDE 8

d2l.ai

Classify Images

http://www.image-net.org/ Yanofsky, Quartz https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/

slide-9
SLIDE 9

d2l.ai

Detect and Segment Objects

https://github.com/matterport/Mask_RCNN

slide-10
SLIDE 10

d2l.ai

Style Transfer

https://github.com/zhanghang1989/MXNet-Gluon-Style-Transfer/

slide-11
SLIDE 11

d2l.ai

Synthesize Faces

Karras et al, arXiv 2019

slide-12
SLIDE 12

d2l.ai

Analogies

https://nlp.stanford.edu/projects/glove/

slide-13
SLIDE 13

d2l.ai

Machine Translation

https://www.pcmag.com/news/349610/google-expands-neural-networks-for-language-translation

slide-14
SLIDE 14

d2l.ai

Text Synthesis

Li et al, NACCL, 2018

slide-15
SLIDE 15

d2l.ai

Question answering

Shi et al, 2018, Arxiv

Q: “What’s her mustache made of?”

Vision Feature Extractor Text Feature Extractor Combine Predictor

A: “Banana” Question Type: “Subordinate Object Recognition”

Question Type Guided Attention

slide-16
SLIDE 16

d2l.ai

Image captioning

Shallue et al, 2016 https://ai.googleblog.com/2016/09/show-and-tell- image-captioning-open.html

slide-17
SLIDE 17

d2l.ai

Problems we will solve: Classification

Given image x estimate label y

cat dog rabbit gerbil

y = f(x) where y ∈ {1,…N}

slide-18
SLIDE 18

d2l.ai

Problems we will solve: Regression

Given image x estimate label y

0.4kg 2kg 4kg 10kg

y = f(x) where y ∈ ℝ

slide-19
SLIDE 19

d2l.ai

Problems we will solve today: Sequence Models

GPT2, 2019

slide-20
SLIDE 20

d2l.ai

Deep

slide-21
SLIDE 21

d2l.ai

N-dimensional Arrays

0-d (scalar)

1.0

A class label 1-d (vector)

[1.0, 2.7, 3.4]

A feature vector

[[1.0, 2.7, 3.4] [5.0, 0.2, 4.6] [4.3, 8.5, 0.2]]

2-d (matrix) An example-by- feature matrix

N-dimensional arrays are the main data structure for machine learning and neural networks

slide-22
SLIDE 22

d2l.ai

N-dimensional Arrays

3-d

[[[0.1, 2.7, 3.4] [5.0, 0.2, 4.6] [4.3, 8.5, 0.2]] [[3.2, 5.7, 3.4] [5.4, 6.2, 3.2] [4.1, 3.5, 6.2]]]

A RGB image (width x height x channels) 4-d

[[[[. . . . . . . . .]]]]

A batch of RGB images (batch-size x width x height x channels) 5-d

[[[[[. . . . . . . . .]]]]]

A batch of videos (batch-size x time x width x height x channels)

slide-23
SLIDE 23

d2l.ai

Element-wise access

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3

element: [1, 2] row: [1, :]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3

column: [1, :]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3

slide-24
SLIDE 24

d2l.ai

slide-25
SLIDE 25

d2l.ai

Calculus - Derivatives

dy dx y a xn nxn−1 u + v du dx + dv dx uv du dx v + dv dx u y = f(u), u = g(x) dy du du dx sin(x) cos(x) exp(x) log(x) 1 x exp(x) dy dx y

d dx x2 = 2x x = 1

E.g. slope of tangent.

Derivative measures the sensitivity to change of the output value with respect to a change in its input value.

slide-26
SLIDE 26

d2l.ai

y = |x|

x = 0 slope=1 slope= - 1 ∂|x| ∂x = 1 if x > 0 −1 if x < 0 a if x = 0, a ∈ [−1,1] Another example: ∂ ∂x max(x,0) = 1 if x > 0 if x < 0 a if x = 0, a ∈ [0,1]

Calculus - Non-differentiable

Extend derivative to non-differentiable cases.

slide-27
SLIDE 27

d2l.ai

Calculus - Gradients

x ∈ ℝn y y ∈ ℝm

∂y ∂x ∂y ∂x ∂y ∂x ∂y ∂x

x

Scalar Vector Scalar Vector

∂y ∂x

(1, n) (1,) (m, 1) (m, n) Gradient is a multi-variable generalization of the derivative.

slide-28
SLIDE 28

d2l.ai

∂y/∂x

y = y1 y2 ⋮ ym x = x1 x2 ⋮ xn

x ∈ ℝn, y ∈ ℝm, ∂y ∂x ∈ ℝm×n

Derivatives for vectors

x ∈ ℝn y y ∈ ℝm

∂y ∂x ∂y ∂x ∂y ∂x ∂y ∂x

x

∂y ∂x =

∂y1 ∂x ∂y2 ∂x

∂ym ∂x

=

∂y1 ∂x1 , ∂y1 ∂x2 , …, ∂y1 ∂xn ∂y2 ∂x1 , ∂y2 ∂x2 , …, ∂y2 ∂xn

∂ym ∂x1 , ∂ym ∂x2 , …, ∂ym ∂xn

[ ∂y ∂x ]ij = dyi dxj

slide-29
SLIDE 29

d2l.ai

Derivatives for vectors

sum(x) ∥x∥2 ∂y ∂x y a 0T a ∂u ∂x ∂u ∂x + ∂v ∂x ∂u ∂x v + ∂v ∂x u au uT ∂v ∂x + vT ∂u ∂x 1T 2xT 0 and 1 are vectors ∂y ∂x y ⟨u, v⟩ u + v uv

E.g.

x ∈ ℝn, y ∈ ℝ, ∂y ∂x ∈ ℝ1×n

Let’s do some exercise!

slide-30
SLIDE 30

d2l.ai

∂y ∂x y a a ∂u ∂x A ∂u ∂x x ∂u ∂x + ∂v ∂x

I

0 and I are matrices a, a and A are not functions of x ∂y ∂x y

u + v

au

Au

Ax A xTA AT x ∈ ℝn, y ∈ ℝm, ∂y ∂x ∈ ℝm×n

E.g.

Derivatives for vectors

Let’s do some exercise!

slide-31
SLIDE 31

d2l.ai

Generalize to Matrices

x y y ∂y ∂x ∂y ∂x ∂y ∂x ∂y ∂x x

Scalar Scalar Vector

X ∂y ∂X Y

Matrix

∂Y ∂x

(n,1) (n, k) (1,n) (k, n) (m,1) (m, l) (m, l) (m,1) (m, n) (m, l, n) (m, k, n) (m, l, k, n)

∂Y ∂x ∂Y ∂X

Vector Matrix

∂y ∂X

(1,) (1,) (1,)

slide-32
SLIDE 32

d2l.ai

Chain Rule

Scalars Vectors y = f(u), u = g(x) ∂y ∂x = ∂y ∂u ∂u ∂x ∂y ∂x = ∂y ∂u ∂u ∂x

(1,n) (1,) (1,n)

∂y ∂x = ∂y ∂u ∂u ∂x

(1,n) (1,k) (k, n)

∂y ∂x = ∂y ∂u ∂u ∂x

(m, n) (m, k) (k, n)

Shapes: Too many shapes to memory …

slide-33
SLIDE 33

d2l.ai

Automatic Differentiation

Computing derivatives by hand is HARD. Chain rule (evaluate e.g. via backprop) ∂y ∂x = ∂y ∂un ∂un ∂un−1 … ∂u2 ∂u1 ∂u1 ∂x Compute graph:

  • Build explicitly 


(TensorFlow, MXNet Symbol)

  • Build implicitly by tracing


(Chainer, PyTorch, DeepNumpy)

slide-34
SLIDE 34

d2l.ai

Automatic Differentiation z = (⟨x, w⟩ − y)

2

1 2 3

a = ⟨x, w⟩ x w y b = a − y

4

z = b2

Computing derivatives by hand is HARD. Chain rule (evaluate e.g. via backprop) ∂y ∂x = ∂y ∂un ∂un ∂un−1 … ∂u2 ∂u1 ∂u1 ∂x Compute graph:

  • Build explicitly 


(TensorFlow, MXNet Symbol)

  • Build implicitly by tracing


(Chainer, PyTorch, DeepNumpy)

slide-35
SLIDE 35

d2l.ai

NumPy & AutoGrad notebook

slide-36
SLIDE 36

d2l.ai

Regression

slide-37
SLIDE 37

d2l.ai

$0.41 g3.4xlarge $1.37 g3.16xlarge $0.73 g3.8xlarge Can we estimate prices (time, server, region)? p = wtime ⋅ t + wserver ⋅ s + wregion ⋅ r

slide-38
SLIDE 38

d2l.ai

Linear Model

  • Basic version
  • Vectorized version
  • n-dimensional inputs
  • Weights:
  • Bias,
  • Vectorized version (Closed form)
  • Add bias as an element in weights

x = [x1, x2, …, xn]T w = [w1, w2, …, wn]⊤ b

̂ y = ⟨w, x⟩ + b ̂ y = w1x1 + w2x2 + … + wnxn + b

X ← [X, 1] w ← [ w b]

̂ y = ⟨w, x⟩

slide-39
SLIDE 39

d2l.ai

Loss

l2

  • Basic version
  • Vectorized version
  • Vectorized version (Closed form)

̂ y = ⟨w, x⟩ + b ̂ y = w1x1 + w2x2 + … + wnxn + b

X ← [X, 1] w ← [ w b]

̂ y = ⟨w, x⟩ + b

ℓ(y, ̂ y) = 1 n (y − ̂ y)

2

ℓ(X, y, w, b) = 1 n y − Xw − b

2

ℓ(X, y, w) = 1 n y − Xw

2

  • Basic version
  • Vectorized version
  • Vectorized version (Closed form)
slide-40
SLIDE 40

d2l.ai

Objective Function

  • Objective is to minimize training loss

argmin

w

loss ⇔ argmin

w

1 n y − Xw

2

⇔ 2 n (y − Xw)

T X = 0

⇔ w* = (XTX)

−1 Xy

⇔ ∂ ∂w ℓ(X, y, w) = 0

ℓ(y, ̂ y) = 1 n (y − ̂ y)

2

ℓ(X, y, w, b) = 1 n y − Xw − b

2

ℓ(X, y, w) = 1 n y − Xw

2

  • Basic version
  • Vectorized version
  • Vectorized version (Closed form)
slide-41
SLIDE 41

d2l.ai

Linear Model as a Single-layer Neural Network

We can stack multiple layers to get deep neural networks

slide-42
SLIDE 42

d2l.ai

Linear Regression notebook

slide-43
SLIDE 43

d2l.ai

Optimization

negative gradient momentum

slide-44
SLIDE 44

d2l.ai

Gradient Descent in 1D

Consider some continuously differentiable real-valued function 𝑔: ℝ→ℝ. Using a Taylor expansion we obtain that: Assume we pick a fixed step size (𝜃>0) and choose :

f(x + ϵ) = f(x) + ϵf′(x) + O(ϵ2) ϵ = − ηf′(x) f(x − ηf′(x)) = f(x) − ηf ′2(x) + O(η2 f ′2(x))

slide-45
SLIDE 45

d2l.ai

Gradient Descent in 1D

If the derivative does not vanish, we make progress since . Moreover, we can always choose 𝜃 small enough for the higher order terms to become irrelevant. Hence we arrive at: This means that, if we use to iterate 𝑦 , the value

  • f function 𝑔(𝑦) might decline.

f′(x) ≠ 0 ηf ′2(x) > 0 f(x − ηf′(x)) ⪅ f(x) x ← x − ηf′(x)

slide-46
SLIDE 46

d2l.ai

Gradient Descent in General

Choose a starting point Repeat to update weight

  • Gradient direction indicates increase in value
  • Learning rate adjusts step length

w0

wt = wt−1 − η∂wl(wt−1)

w0 w1 w2

slide-47
SLIDE 47

d2l.ai

Goldilocks Learning Rate

Too small Too big

More iterations May diverge

slide-48
SLIDE 48

d2l.ai

Mini-batch Stochastic Gradient Descent (SGD)

  • Computing the gradient over all data is too slow
  • Redundancy in data (e.g. many similar digits) 

  • Single observation is not efficient on GPU

  • Sample examples

to approximate loss/gradient
 
 
 
 b is the mini-batch size (chosen for GPU efficiency)

b i1, …ib

1 b ∑

i∈Ib

l(xi, yi, w) and 1 b ∑

i∈Ib

∂wl(xi, yi, w)

slide-49
SLIDE 49

d2l.ai

Gradient Descent (SGD)

Batch Size Computation Memory Pros and Cons GD Training Size Efficient Inefficient Parallel processing available stable gradient descent, but maybe bad (saddle point). Mini- batch GD n Okay Okay A compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence SGD 1 Inefficient Efficient Can escape from saddle points or local minimal, but maybe very noisy

slide-50
SLIDE 50

d2l.ai

Softmax Regression

slide-51
SLIDE 51

d2l.ai

Regression vs. Classification

Regression estimates a continuous value Classification predicts a discrete category

MNIST: classify hand-written digits (10 classes) Housing price predictions

slide-52
SLIDE 52

d2l.ai

From Regression to Multi-class Classification

Regression

  • Single continuous output
  • Natural scale in
  • One Loss given
  • e.g. in terms of difference

y − f(x)

Multi-class Classification

  • Multiple outputs, one for each

class

  • Multiple Loss given
  • Outputs should reflect confidence
slide-53
SLIDE 53

d2l.ai

  • One hot encoding per class
  • Train with squared loss
  • Largest output wins
  • max is not differentiable

̂ y = argmax

i

  • i

y = [y1, y2, …, yn]⊤ yi = { 1 if i = y 0 otherwise

From Regression to Multi-class Classification

Example: Given y = [0, 0, 1] Predict = [0.3, 0, 7000] Losses are not in the same scale!

̂ y

Use Square Loss?

slide-54
SLIDE 54

d2l.ai

From Regression to Multi-class Classification

Softmax Regression

  • Probability indicates confidence


(nonnegative, sums to 1)
 
 
 


[ eo1 ∑n

i=1 eoi,

eo2 ∑n

i=1 eoi, …,

eon ∑n

i=1 eoi ]

softmax([o1, o2, …, on]T) = Example: given scores [1, -1, 2] softmax: [0.26, 0.04, 0.7]

  • One hot encoding per class
  • Train with squared loss
  • Largest output wins
  • max is not differentiable

̂ y = argmax

i

  • i

y = [y1, y2, …, yn]⊤ yi = { 1 if i = y 0 otherwise

slide-55
SLIDE 55

d2l.ai

∂ol(y, o) = exp(o) ∑i exp(oi) − y

Cross-Entropy Loss

  • (element-wise) Negative log-likelihood (for given class

)

y ∈ {0,1}

−log p(y|o) = log∑

i

exp(oi) − oy l(y, o) = log∑

i

exp(oi) − y⊤o Difference between true and estimated probability

  • (vector-wise) Cross-Entropy Loss (for probability distribution

)

yi ∈ [0,1]

  • Gradient
slide-56
SLIDE 56

d2l.ai

Regression vs Softmax Regression

⟨w, x⟩ + b

softmax(Wx + b)

Softmax Regression (k-class classification) Regression (Linear Model) Problem Model Loss Squared loss Cross-entropy loss

w ∈ ℝn, b ∈ ℝ

W ∈ ℝk×n, b ∈ ℝk

slide-57
SLIDE 57

d2l.ai

Softmax Regression Notebook

slide-58
SLIDE 58

d2l.ai

Multilayer Perceptron

slide-59
SLIDE 59

d2l.ai

Neural Networks Derive from Neuroscience

Inputs Computation happens here Output Send to the next layer

The real neuron

slide-60
SLIDE 60

d2l.ai

Linear Model as a Single-layer Neural Network

slide-61
SLIDE 61

d2l.ai

Single Hidden Layer

We can stack multiple neurons in one layer.

slide-62
SLIDE 62

d2l.ai

Single Hidden Layer

We can stack multiple layers to get deep neural networks.

slide-63
SLIDE 63

d2l.ai

Single Hidden Layer

  • Input
  • Hidden
  • Output

x ∈ ℝn

W1 ∈ ℝm×n, b1 ∈ ℝm w2 ∈ ℝm, b2 ∈ ℝ

h = σ(W1x + b1)

  • = wT

2h + b2

is an element-wise activation function σ

slide-64
SLIDE 64

d2l.ai

Single Hidden Layer

h = σ(W1x + b1)

  • = wT

2h + b2

is an element-wise activation function σ

Why do we need an a nonlinear activation?

slide-65
SLIDE 65

d2l.ai

Single Hidden Layer

h = W1x + b1

  • = wT

2h + b2

hence o = w⊤

2W1x + b′

Linear … Why do we need an a nonlinear activation?

slide-66
SLIDE 66

d2l.ai

ReLU Activation

ReLU: rectified linear unit

ReLU(x) = max(x,0)

slide-67
SLIDE 67

d2l.ai

Sigmoid Activation

Map input into (0, 1), a soft version of

sigmoid(x) = 1 1 + exp(−x) σ(x) = { 1 if x > 0 0 otherwise

slide-68
SLIDE 68

d2l.ai

Tanh Activation

Map inputs into (-1, 1)

tanh(x) = 1 − exp(−2x) 1 + exp(−2x)

slide-69
SLIDE 69

d2l.ai

Multiclass Classification

y1, y2, …, yk = softmax(o1, o2, …, ok)

slide-70
SLIDE 70

d2l.ai

Multiple Hidden Layers

Hyper-parameters

  • # of hidden layers
  • Hidden size for each layer

h1 = σ(W1x + b1) h2 = σ(W2h1 + b2) h3 = σ(W3h2 + b3)

  • = W4h3 + b4
slide-71
SLIDE 71

d2l.ai

MLP Notebook

slide-72
SLIDE 72

d2l.ai

Summary

  • Installations
  • Deep Learning Motivations
  • DeepNumpy & Calculus
  • Regression
  • Optimization
  • Softmax Regression
  • Multilayer Perceptron (train MNIST)
slide-73
SLIDE 73

d2l.ai

Questions?