d2l.ai
Deep Learning Basics
Rachel Hu and Zhi Zhang
Amazon AI
Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai - - PowerPoint PPT Presentation
Deep Learning Basics Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Installations Deep Learning Motivations DeepNumpy & Calculus Regression Optimization Softmax Regression Multilayer Perceptron (train
d2l.ai
Deep Learning Basics
Rachel Hu and Zhi Zhang
Amazon AI
d2l.ai
Outline
d2l.ai
d2l.ai
Installations
d2l.ai
Installations
Detailed step-by-step instructions on local (Mac or Linux):
d2l.ai
d2l.ai
Classify Images
http://www.image-net.org/
d2l.ai
Classify Images
http://www.image-net.org/ Yanofsky, Quartz https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
d2l.ai
Detect and Segment Objects
https://github.com/matterport/Mask_RCNN
d2l.ai
Style Transfer
https://github.com/zhanghang1989/MXNet-Gluon-Style-Transfer/
d2l.ai
Synthesize Faces
Karras et al, arXiv 2019
d2l.ai
Analogies
https://nlp.stanford.edu/projects/glove/
d2l.ai
Machine Translation
https://www.pcmag.com/news/349610/google-expands-neural-networks-for-language-translation
d2l.ai
Text Synthesis
Li et al, NACCL, 2018
d2l.ai
Question answering
Shi et al, 2018, Arxiv
Q: “What’s her mustache made of?”
Vision Feature Extractor Text Feature Extractor Combine Predictor
A: “Banana” Question Type: “Subordinate Object Recognition”
Question Type Guided Attention
d2l.ai
Image captioning
Shallue et al, 2016 https://ai.googleblog.com/2016/09/show-and-tell- image-captioning-open.html
d2l.ai
Problems we will solve: Classification
Given image x estimate label y
cat dog rabbit gerbil
y = f(x) where y ∈ {1,…N}
d2l.ai
Problems we will solve: Regression
Given image x estimate label y
0.4kg 2kg 4kg 10kg
y = f(x) where y ∈ ℝ
d2l.ai
Problems we will solve today: Sequence Models
GPT2, 2019
d2l.ai
d2l.ai
N-dimensional Arrays
0-d (scalar)
1.0
A class label 1-d (vector)
[1.0, 2.7, 3.4]
A feature vector
[[1.0, 2.7, 3.4] [5.0, 0.2, 4.6] [4.3, 8.5, 0.2]]
2-d (matrix) An example-by- feature matrix
N-dimensional arrays are the main data structure for machine learning and neural networks
d2l.ai
N-dimensional Arrays
3-d
[[[0.1, 2.7, 3.4] [5.0, 0.2, 4.6] [4.3, 8.5, 0.2]] [[3.2, 5.7, 3.4] [5.4, 6.2, 3.2] [4.1, 3.5, 6.2]]]
A RGB image (width x height x channels) 4-d
[[[[. . . . . . . . .]]]]
A batch of RGB images (batch-size x width x height x channels) 5-d
[[[[[. . . . . . . . .]]]]]
A batch of videos (batch-size x time x width x height x channels)
d2l.ai
Element-wise access
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3
element: [1, 2] row: [1, :]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3
column: [1, :]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 1 2 3
d2l.ai
d2l.ai
Calculus - Derivatives
dy dx y a xn nxn−1 u + v du dx + dv dx uv du dx v + dv dx u y = f(u), u = g(x) dy du du dx sin(x) cos(x) exp(x) log(x) 1 x exp(x) dy dx y
d dx x2 = 2x x = 1
E.g. slope of tangent.
Derivative measures the sensitivity to change of the output value with respect to a change in its input value.
d2l.ai
y = |x|
x = 0 slope=1 slope= - 1 ∂|x| ∂x = 1 if x > 0 −1 if x < 0 a if x = 0, a ∈ [−1,1] Another example: ∂ ∂x max(x,0) = 1 if x > 0 if x < 0 a if x = 0, a ∈ [0,1]
Calculus - Non-differentiable
Extend derivative to non-differentiable cases.
d2l.ai
Calculus - Gradients
x ∈ ℝn y y ∈ ℝm
∂y ∂x ∂y ∂x ∂y ∂x ∂y ∂x
x
Scalar Vector Scalar Vector
∂y ∂x
(1, n) (1,) (m, 1) (m, n) Gradient is a multi-variable generalization of the derivative.
d2l.ai
∂y/∂x
y = y1 y2 ⋮ ym x = x1 x2 ⋮ xn
x ∈ ℝn, y ∈ ℝm, ∂y ∂x ∈ ℝm×n
Derivatives for vectors
x ∈ ℝn y y ∈ ℝm
∂y ∂x ∂y ∂x ∂y ∂x ∂y ∂x
x
∂y ∂x =
∂y1 ∂x ∂y2 ∂x
⋮
∂ym ∂x
=
∂y1 ∂x1 , ∂y1 ∂x2 , …, ∂y1 ∂xn ∂y2 ∂x1 , ∂y2 ∂x2 , …, ∂y2 ∂xn
⋮
∂ym ∂x1 , ∂ym ∂x2 , …, ∂ym ∂xn
[ ∂y ∂x ]ij = dyi dxj
d2l.ai
Derivatives for vectors
sum(x) ∥x∥2 ∂y ∂x y a 0T a ∂u ∂x ∂u ∂x + ∂v ∂x ∂u ∂x v + ∂v ∂x u au uT ∂v ∂x + vT ∂u ∂x 1T 2xT 0 and 1 are vectors ∂y ∂x y ⟨u, v⟩ u + v uv
E.g.
x ∈ ℝn, y ∈ ℝ, ∂y ∂x ∈ ℝ1×n
Let’s do some exercise!
d2l.ai
∂y ∂x y a a ∂u ∂x A ∂u ∂x x ∂u ∂x + ∂v ∂x
I
0 and I are matrices a, a and A are not functions of x ∂y ∂x y
u + v
au
Au
Ax A xTA AT x ∈ ℝn, y ∈ ℝm, ∂y ∂x ∈ ℝm×n
E.g.
Derivatives for vectors
Let’s do some exercise!
d2l.ai
Generalize to Matrices
x y y ∂y ∂x ∂y ∂x ∂y ∂x ∂y ∂x x
Scalar Scalar Vector
X ∂y ∂X Y
Matrix
∂Y ∂x
(n,1) (n, k) (1,n) (k, n) (m,1) (m, l) (m, l) (m,1) (m, n) (m, l, n) (m, k, n) (m, l, k, n)
∂Y ∂x ∂Y ∂X
Vector Matrix
∂y ∂X
(1,) (1,) (1,)
d2l.ai
Chain Rule
Scalars Vectors y = f(u), u = g(x) ∂y ∂x = ∂y ∂u ∂u ∂x ∂y ∂x = ∂y ∂u ∂u ∂x
(1,n) (1,) (1,n)
∂y ∂x = ∂y ∂u ∂u ∂x
(1,n) (1,k) (k, n)
∂y ∂x = ∂y ∂u ∂u ∂x
(m, n) (m, k) (k, n)
Shapes: Too many shapes to memory …
d2l.ai
Automatic Differentiation
Computing derivatives by hand is HARD. Chain rule (evaluate e.g. via backprop) ∂y ∂x = ∂y ∂un ∂un ∂un−1 … ∂u2 ∂u1 ∂u1 ∂x Compute graph:
(TensorFlow, MXNet Symbol)
(Chainer, PyTorch, DeepNumpy)
d2l.ai
Automatic Differentiation z = (⟨x, w⟩ − y)
2
1 2 3
a = ⟨x, w⟩ x w y b = a − y
4
z = b2
Computing derivatives by hand is HARD. Chain rule (evaluate e.g. via backprop) ∂y ∂x = ∂y ∂un ∂un ∂un−1 … ∂u2 ∂u1 ∂u1 ∂x Compute graph:
(TensorFlow, MXNet Symbol)
(Chainer, PyTorch, DeepNumpy)
d2l.ai
NumPy & AutoGrad notebook
d2l.ai
d2l.ai
$0.41 g3.4xlarge $1.37 g3.16xlarge $0.73 g3.8xlarge Can we estimate prices (time, server, region)? p = wtime ⋅ t + wserver ⋅ s + wregion ⋅ r
d2l.ai
Linear Model
x = [x1, x2, …, xn]T w = [w1, w2, …, wn]⊤ b
̂ y = ⟨w, x⟩ + b ̂ y = w1x1 + w2x2 + … + wnxn + b
X ← [X, 1] w ← [ w b]
̂ y = ⟨w, x⟩
d2l.ai
Loss
l2
̂ y = ⟨w, x⟩ + b ̂ y = w1x1 + w2x2 + … + wnxn + b
X ← [X, 1] w ← [ w b]
̂ y = ⟨w, x⟩ + b
ℓ(y, ̂ y) = 1 n (y − ̂ y)
2
ℓ(X, y, w, b) = 1 n y − Xw − b
2
ℓ(X, y, w) = 1 n y − Xw
2
d2l.ai
Objective Function
argmin
w
loss ⇔ argmin
w
1 n y − Xw
2
⇔ 2 n (y − Xw)
T X = 0
⇔ w* = (XTX)
−1 Xy
⇔ ∂ ∂w ℓ(X, y, w) = 0
ℓ(y, ̂ y) = 1 n (y − ̂ y)
2
ℓ(X, y, w, b) = 1 n y − Xw − b
2
ℓ(X, y, w) = 1 n y − Xw
2
d2l.ai
Linear Model as a Single-layer Neural Network
We can stack multiple layers to get deep neural networks
d2l.ai
Linear Regression notebook
d2l.ai
negative gradient momentum
d2l.ai
Gradient Descent in 1D
Consider some continuously differentiable real-valued function 𝑔: ℝ→ℝ. Using a Taylor expansion we obtain that: Assume we pick a fixed step size (𝜃>0) and choose :
f(x + ϵ) = f(x) + ϵf′(x) + O(ϵ2) ϵ = − ηf′(x) f(x − ηf′(x)) = f(x) − ηf ′2(x) + O(η2 f ′2(x))
d2l.ai
Gradient Descent in 1D
If the derivative does not vanish, we make progress since . Moreover, we can always choose 𝜃 small enough for the higher order terms to become irrelevant. Hence we arrive at: This means that, if we use to iterate 𝑦 , the value
f′(x) ≠ 0 ηf ′2(x) > 0 f(x − ηf′(x)) ⪅ f(x) x ← x − ηf′(x)
d2l.ai
Gradient Descent in General
Choose a starting point Repeat to update weight
w0
wt = wt−1 − η∂wl(wt−1)
w0 w1 w2
d2l.ai
Goldilocks Learning Rate
Too small Too big
More iterations May diverge
d2l.ai
Mini-batch Stochastic Gradient Descent (SGD)
to approximate loss/gradient b is the mini-batch size (chosen for GPU efficiency)
b i1, …ib
1 b ∑
i∈Ib
l(xi, yi, w) and 1 b ∑
i∈Ib
∂wl(xi, yi, w)
d2l.ai
Gradient Descent (SGD)
Batch Size Computation Memory Pros and Cons GD Training Size Efficient Inefficient Parallel processing available stable gradient descent, but maybe bad (saddle point). Mini- batch GD n Okay Okay A compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence SGD 1 Inefficient Efficient Can escape from saddle points or local minimal, but maybe very noisy
d2l.ai
d2l.ai
Regression vs. Classification
Regression estimates a continuous value Classification predicts a discrete category
MNIST: classify hand-written digits (10 classes) Housing price predictions
d2l.ai
From Regression to Multi-class Classification
Regression
ℝ
y − f(x)
Multi-class Classification
class
d2l.ai
̂ y = argmax
i
y = [y1, y2, …, yn]⊤ yi = { 1 if i = y 0 otherwise
From Regression to Multi-class Classification
Example: Given y = [0, 0, 1] Predict = [0.3, 0, 7000] Losses are not in the same scale!
̂ y
Use Square Loss?
d2l.ai
From Regression to Multi-class Classification
Softmax Regression
(nonnegative, sums to 1)
[ eo1 ∑n
i=1 eoi,
eo2 ∑n
i=1 eoi, …,
eon ∑n
i=1 eoi ]
softmax([o1, o2, …, on]T) = Example: given scores [1, -1, 2] softmax: [0.26, 0.04, 0.7]
̂ y = argmax
i
y = [y1, y2, …, yn]⊤ yi = { 1 if i = y 0 otherwise
d2l.ai
∂ol(y, o) = exp(o) ∑i exp(oi) − y
Cross-Entropy Loss
)
y ∈ {0,1}
−log p(y|o) = log∑
i
exp(oi) − oy l(y, o) = log∑
i
exp(oi) − y⊤o Difference between true and estimated probability
)
yi ∈ [0,1]
d2l.ai
Regression vs Softmax Regression
⟨w, x⟩ + b
softmax(Wx + b)
Softmax Regression (k-class classification) Regression (Linear Model) Problem Model Loss Squared loss Cross-entropy loss
w ∈ ℝn, b ∈ ℝ
W ∈ ℝk×n, b ∈ ℝk
d2l.ai
Softmax Regression Notebook
d2l.ai
d2l.ai
Neural Networks Derive from Neuroscience
Inputs Computation happens here Output Send to the next layer
The real neuron
d2l.ai
Linear Model as a Single-layer Neural Network
d2l.ai
Single Hidden Layer
We can stack multiple neurons in one layer.
d2l.ai
Single Hidden Layer
We can stack multiple layers to get deep neural networks.
d2l.ai
Single Hidden Layer
x ∈ ℝn
W1 ∈ ℝm×n, b1 ∈ ℝm w2 ∈ ℝm, b2 ∈ ℝ
h = σ(W1x + b1)
2h + b2
is an element-wise activation function σ
d2l.ai
Single Hidden Layer
h = σ(W1x + b1)
2h + b2
is an element-wise activation function σ
Why do we need an a nonlinear activation?
d2l.ai
Single Hidden Layer
h = W1x + b1
2h + b2
hence o = w⊤
2W1x + b′
Linear … Why do we need an a nonlinear activation?
d2l.ai
ReLU Activation
ReLU: rectified linear unit
ReLU(x) = max(x,0)
d2l.ai
Sigmoid Activation
Map input into (0, 1), a soft version of
sigmoid(x) = 1 1 + exp(−x) σ(x) = { 1 if x > 0 0 otherwise
d2l.ai
Tanh Activation
Map inputs into (-1, 1)
tanh(x) = 1 − exp(−2x) 1 + exp(−2x)
d2l.ai
Multiclass Classification
y1, y2, …, yk = softmax(o1, o2, …, ok)
d2l.ai
Multiple Hidden Layers
Hyper-parameters
h1 = σ(W1x + b1) h2 = σ(W2h1 + b2) h3 = σ(W3h2 + b3)
d2l.ai
MLP Notebook
d2l.ai
Summary
d2l.ai