Full Stack Deep Learning Troubleshooting Deep Neural Networks Josh - - PowerPoint PPT Presentation

full stack deep learning
SMART_READER_LITE
LIVE PREVIEW

Full Stack Deep Learning Troubleshooting Deep Neural Networks Josh - - PowerPoint PPT Presentation

Full Stack Deep Learning Troubleshooting Deep Neural Networks Josh Tobin, Sergey Karayev, Pieter Abbeel Lifecycle of a ML project Cross-project Per-project infrastructure activities Planning & Team & hiring project setup Data


slide-1
SLIDE 1

Full Stack Deep Learning

Troubleshooting Deep Neural Networks Josh Tobin, Sergey Karayev, Pieter Abbeel

slide-2
SLIDE 2

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Lifecycle of a ML project

2

Planning & 
 project setup Data collection & labeling Training & debugging Deploying & 
 testing Team & hiring Per-project activities Infra & tooling Cross-project infrastructure

slide-3
SLIDE 3

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why talk about DL troubleshooting?

3

XKCD, https://xkcd.com/1838/

slide-4
SLIDE 4

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why talk about DL troubleshooting?

4

slide-5
SLIDE 5

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why talk about DL troubleshooting?

5

Common sentiment among practitioners: 80-90% of time debugging and tuning 10-20% deriving math or implementing things

slide-6
SLIDE 6

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is DL troubleshooting so hard?

6

slide-7
SLIDE 7

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Suppose you can’t reproduce a result

7

He, Kaiming, et al. "Deep residual learning for image recognition." 
 Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Your learning curve

  • 0. Why is troubleshooting hard?
slide-8
SLIDE 8

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

8

Poor model performance

  • 0. Why is troubleshooting hard?
slide-9
SLIDE 9

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

9

Poor model performance Implementation bugs

  • 0. Why is troubleshooting hard?
slide-10
SLIDE 10

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Most DL bugs are invisible

10

  • 0. Why is troubleshooting hard?
slide-11
SLIDE 11

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

  • 0. Why is troubleshooting hard?

Most DL bugs are invisible

11

Labels out of order!

slide-12
SLIDE 12

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

12

Poor model performance Implementation bugs

  • 0. Why is troubleshooting hard?
slide-13
SLIDE 13

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

13

Poor model performance Implementation bugs Hyperparameter choices

  • 0. Why is troubleshooting hard?
slide-14
SLIDE 14

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Models are sensitive to hyperparameters

14

  • 0. Why is troubleshooting hard?

Andrej Karpathy, CS231n course notes

slide-15
SLIDE 15

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Andrej Karpathy, CS231n course notes He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE international conference on computer vision. 2015.

Models are sensitive to hyperparameters

15

  • 0. Why is troubleshooting hard?
slide-16
SLIDE 16

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

16

Poor model performance Implementation bugs Hyperparameter choices

  • 0. Why is troubleshooting hard?
slide-17
SLIDE 17

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

17

Data/model fit Poor model performance Implementation bugs Hyperparameter choices

  • 0. Why is troubleshooting hard?
slide-18
SLIDE 18

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Data / model fit

18

Yours: self-driving car images

  • 0. Why is troubleshooting hard?

Data from the paper: ImageNet

slide-19
SLIDE 19

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

19

Data/model fit Poor model performance Implementation bugs Hyperparameter choices

  • 0. Why is troubleshooting hard?
slide-20
SLIDE 20

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Why is your performance worse?

20

Dataset construction Data/model fit Poor model performance Implementation bugs Hyperparameter choices

  • 0. Why is troubleshooting hard?
slide-21
SLIDE 21

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Constructing good datasets is hard

21

Amount of lost sleep over...

PhD Tesla

Slide from Andrej Karpathy’s talk “Building the Software 2.0 Stack” at TrainAI 2018, 5/10/2018

  • 0. Why is troubleshooting hard?
slide-22
SLIDE 22

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Common dataset construction issues

22

  • 0. Why is troubleshooting hard?
  • Not enough data
  • Class imbalances
  • Noisy labels
  • Train / test from different distributions
  • etc
slide-23
SLIDE 23

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Takeaways: why is troubleshooting hard?

23

  • 0. Why is troubleshooting hard?
  • Hard to tell if you have a bug
  • Lots of possible sources for the

same degradation in performance

  • Results can be sensitive to small

changes in hyperparameters and dataset makeup

slide-24
SLIDE 24

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Strategy for DL troubleshooting

24

slide-25
SLIDE 25

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Key mindset for DL troubleshooting

25

Pessimism.

slide-26
SLIDE 26

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Key idea of DL troubleshooting

26

…Start simple and gradually ramp up complexity Since it’s hard to disambiguate errors…

slide-27
SLIDE 27

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Strategy for DL troubleshooting

27

Tune hyper- parameters Implement & debug Start simple Evaluate Improve model/data Meets re- quirements

slide-28
SLIDE 28

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Quick summary

28

Overview Start 
 simple

  • Choose the simplest model & data possible

(e.g., LeNet on a subset of your data)

slide-29
SLIDE 29

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Quick summary

29

Overview Implement & debug Start 
 simple

  • Choose the simplest model & data possible

(e.g., LeNet on a subset of your data)

  • Once model runs, overfit a single batch &

reproduce a known result

slide-30
SLIDE 30

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Quick summary

30

Overview Implement & debug Start 
 simple Evaluate

  • Choose the simplest model & data possible

(e.g., LeNet on a subset of your data)

  • Once model runs, overfit a single batch &

reproduce a known result

  • Apply the bias-variance decomposition to

decide what to do next

slide-31
SLIDE 31

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Quick summary

31

Overview Tune hyp- eparams Implement & debug Start 
 simple Evaluate

  • Choose the simplest model & data possible

(e.g., LeNet on a subset of your data)

  • Once model runs, overfit a single batch &

reproduce a known result

  • Apply the bias-variance decomposition to

decide what to do next

  • Use coarse-to-fine random searches
slide-32
SLIDE 32

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Tune hyp- eparams

Quick summary

32

Implement & debug Start 
 simple Evaluate Improve model/data

  • Choose the simplest model & data possible

(e.g., LeNet on a subset of your data)

  • Once model runs, overfit a single batch &

reproduce a known result

  • Apply the bias-variance decomposition to

decide what to do next

  • Use coarse-to-fine random searches
  • Make your model bigger if you underfit; add

data or regularize if you overfit Overview

slide-33
SLIDE 33

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

We’ll assume you already have…

33

  • Initial test set
  • A single metric to improve
  • Target performance based on human-level

performance, published results, previous baselines, etc

slide-34
SLIDE 34

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

We’ll assume you already have…

34

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • Initial test set
  • A single metric to improve
  • Target performance based on human-level

performance, published results, previous baselines, etc

slide-35
SLIDE 35

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Tune hyper- parameters Implement & debug Start simple Evaluate Improve model/data Meets re- quirements

Strategy for DL troubleshooting

35

slide-36
SLIDE 36

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Starting simple

36

Normalize inputs Choose a simple architecture Simplify the problem Use sensible defaults

Steps

b a c d

  • 1. Start simple
slide-37
SLIDE 37

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Demystifying architecture selection

37

Start here Consider using this later Images LeNet-like architecture LSTM with one hidden layer (or temporal convs) Fully connected neural net with one hidden layer ResNet Attention model or WaveNet-like model Problem-dependent Images Sequences Other

  • 1. Start simple
slide-38
SLIDE 38

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Dealing with multiple input modalities

38

“This” “is” “a” “cat”

Input 2 Input 3 Input 1

  • 1. Start simple
slide-39
SLIDE 39

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Dealing with multiple input modalities

39

“This” “is” “a” “cat”

Input 2 Input 3 Input 1

  • 1. Map each into a lower dimensional feature space
  • 1. Start simple
slide-40
SLIDE 40

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Dealing with multiple input modalities

40

ConvNet Flatten “This” “is” “a” “cat” LSTM

Input 2 Input 3 Input 1

(64-dim)

(72-dim) (48-dim)

  • 1. Map each into a lower dimensional feature space
  • 1. Start simple
slide-41
SLIDE 41

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Dealing with multiple input modalities

41

ConvNet Flatten Con“cat” “This” “is” “a” “cat” LSTM

Input 2 Input 3 Input 1

(64-dim)

(72-dim) (48-dim) (184-dim)

  • 2. Concatenate
  • 1. Start simple
slide-42
SLIDE 42

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Dealing with multiple input modalities

42

ConvNet Flatten Concat “This” “is” “a” “cat” LSTM FC FC Output T/F

Input 2 Input 3 Input 1

(64-dim)

(72-dim) (48-dim) (184-dim) (256-dim) (128-dim)

  • 3. Pass through fully connected layers to output
  • 1. Start simple
slide-43
SLIDE 43

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Starting simple

43

Normalize inputs Choose a simple architecture Simplify the problem Use sensible defaults

Steps

b a c d

  • 1. Start simple
slide-44
SLIDE 44

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Recommended network / optimizer defaults

44

  • Optimizer: Adam optimizer with learning rate 3e-4
  • Activations: relu (FC and Conv models), tanh (LSTMs)
  • Initialization: He et al. normal (relu), Glorot normal (tanh)
  • Regularization: None
  • Data normalization: None
  • 1. Start simple
slide-45
SLIDE 45

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Definitions of recommended initializers

45

N 0, r 2 n !

<latexit sha1_base64="71oIrQeCI+zKtAqPcxHfa3NQrcI=">ACFnicbVDLSgMxFM3UV62vqks3wSJU0DJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMVbvwVNy4UcSvu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzUL567AYERJSKp65gPpTtM+zqewWJ6ytCk2qayDR1FR+O4LRfLNkVewq8TJw5KaE5Gv3ilzsIaRwCVQrbuOHUEvIQo4FSwtuLFmEaFjMmRdQyUJmO4l01gpPjHKAPuhMk8Cnq/NxISaD0JPDOZhdCLXib+53Vj8K96CZdRDEzS2SE/FhCnHWEB1wxCmJiCKGKm79iOiKmDTBNFkwJzmLkZdKqVhy74txelGr1eR15dISOURk56BLV0A1qoCai6BE9o1f0Zj1ZL9a79TEbzVnznUP0B9bnD/n6n+k=</latexit><latexit sha1_base64="71oIrQeCI+zKtAqPcxHfa3NQrcI=">ACFnicbVDLSgMxFM3UV62vqks3wSJU0DJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMVbvwVNy4UcSvu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzUL567AYERJSKp65gPpTtM+zqewWJ6ytCk2qayDR1FR+O4LRfLNkVewq8TJw5KaE5Gv3ilzsIaRwCVQrbuOHUEvIQo4FSwtuLFmEaFjMmRdQyUJmO4l01gpPjHKAPuhMk8Cnq/NxISaD0JPDOZhdCLXib+53Vj8K96CZdRDEzS2SE/FhCnHWEB1wxCmJiCKGKm79iOiKmDTBNFkwJzmLkZdKqVhy74txelGr1eR15dISOURk56BLV0A1qoCai6BE9o1f0Zj1ZL9a79TEbzVnznUP0B9bnD/n6n+k=</latexit><latexit sha1_base64="71oIrQeCI+zKtAqPcxHfa3NQrcI=">ACFnicbVDLSgMxFM3UV62vqks3wSJU0DJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMVbvwVNy4UcSvu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzUL567AYERJSKp65gPpTtM+zqewWJ6ytCk2qayDR1FR+O4LRfLNkVewq8TJw5KaE5Gv3ilzsIaRwCVQrbuOHUEvIQo4FSwtuLFmEaFjMmRdQyUJmO4l01gpPjHKAPuhMk8Cnq/NxISaD0JPDOZhdCLXib+53Vj8K96CZdRDEzS2SE/FhCnHWEB1wxCmJiCKGKm79iOiKmDTBNFkwJzmLkZdKqVhy74txelGr1eR15dISOURk56BLV0A1qoCai6BE9o1f0Zj1ZL9a79TEbzVnznUP0B9bnD/n6n+k=</latexit><latexit sha1_base64="71oIrQeCI+zKtAqPcxHfa3NQrcI=">ACFnicbVDLSgMxFM3UV62vqks3wSJU0DJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMVbvwVNy4UcSvu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzUL567AYERJSKp65gPpTtM+zqewWJ6ytCk2qayDR1FR+O4LRfLNkVewq8TJw5KaE5Gv3ilzsIaRwCVQrbuOHUEvIQo4FSwtuLFmEaFjMmRdQyUJmO4l01gpPjHKAPuhMk8Cnq/NxISaD0JPDOZhdCLXib+53Vj8K96CZdRDEzS2SE/FhCnHWEB1wxCmJiCKGKm79iOiKmDTBNFkwJzmLkZdKqVhy74txelGr1eR15dISOURk56BLV0A1qoCai6BE9o1f0Zj1ZL9a79TEbzVnznUP0B9bnD/n6n+k=</latexit>

N 0, r 2 n + m !

<latexit sha1_base64="zIQOpEBJhC7QdJVej5ZiUKz47uk=">ACGnicbVDLSgMxFM3UV62vqks3wSJUlDJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMdbvwVNy4UcSdu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzULzpuQGBEiUjqSuYD2X7HLv6XkHi+orQpJomEp/hIE1dxYcjO0XS3bFngIvE2dOSmiORr/46Q5CGgdMAhVE65jR9BLiAJOBUsLbqxZROiYDFnXUEkCpnvJNFqKT4wywH6ozJOAp+rvjYQEWk8Cz0xmQfSil4n/ed0Y/KtewmUA5N0dsiPBYQZz3hAVeMgpgYQqji5q+YjohpBEybBVOCsxh5mbSqFceuOLcXpVp9XkceHaFjVEYOukQ1dIMaqIkoekTP6BW9WU/Wi/VufcxGc9Z85xD9gfX1Axa5oOk=</latexit><latexit sha1_base64="zIQOpEBJhC7QdJVej5ZiUKz47uk=">ACGnicbVDLSgMxFM3UV62vqks3wSJUlDJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMdbvwVNy4UcSdu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzULzpuQGBEiUjqSuYD2X7HLv6XkHi+orQpJomEp/hIE1dxYcjO0XS3bFngIvE2dOSmiORr/46Q5CGgdMAhVE65jR9BLiAJOBUsLbqxZROiYDFnXUEkCpnvJNFqKT4wywH6ozJOAp+rvjYQEWk8Cz0xmQfSil4n/ed0Y/KtewmUA5N0dsiPBYQZz3hAVeMgpgYQqji5q+YjohpBEybBVOCsxh5mbSqFceuOLcXpVp9XkceHaFjVEYOukQ1dIMaqIkoekTP6BW9WU/Wi/VufcxGc9Z85xD9gfX1Axa5oOk=</latexit><latexit sha1_base64="zIQOpEBJhC7QdJVej5ZiUKz47uk=">ACGnicbVDLSgMxFM3UV62vqks3wSJUlDJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMdbvwVNy4UcSdu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzULzpuQGBEiUjqSuYD2X7HLv6XkHi+orQpJomEp/hIE1dxYcjO0XS3bFngIvE2dOSmiORr/46Q5CGgdMAhVE65jR9BLiAJOBUsLbqxZROiYDFnXUEkCpnvJNFqKT4wywH6ozJOAp+rvjYQEWk8Cz0xmQfSil4n/ed0Y/KtewmUA5N0dsiPBYQZz3hAVeMgpgYQqji5q+YjohpBEybBVOCsxh5mbSqFceuOLcXpVp9XkceHaFjVEYOukQ1dIMaqIkoekTP6BW9WU/Wi/VufcxGc9Z85xD9gfX1Axa5oOk=</latexit><latexit sha1_base64="zIQOpEBJhC7QdJVej5ZiUKz47uk=">ACGnicbVDLSgMxFM3UV62vqks3wSJUlDJTBF0W3LgqFewDOqVk0kwbmsmMyR2hDPMdbvwVNy4UcSdu/BszbRfaeiBwOdebs7xIsE12Pa3lVtZXVvfyG8WtrZ3dveK+wctHcaKsiYNRag6HtFMcMmawEGwTqQYCTzB2t74OvPbD0xpHso7mESsF5Ch5D6nBIzULzpuQGBEiUjqSuYD2X7HLv6XkHi+orQpJomEp/hIE1dxYcjO0XS3bFngIvE2dOSmiORr/46Q5CGgdMAhVE65jR9BLiAJOBUsLbqxZROiYDFnXUEkCpnvJNFqKT4wywH6ozJOAp+rvjYQEWk8Cz0xmQfSil4n/ed0Y/KtewmUA5N0dsiPBYQZz3hAVeMgpgYQqji5q+YjohpBEybBVOCsxh5mbSqFceuOLcXpVp9XkceHaFjVEYOukQ1dIMaqIkoekTP6BW9WU/Wi/VufcxGc9Z85xD9gfX1Axa5oOk=</latexit>
  • (n is the number of inputs, m is

the number of outputs)

  • He et al. normal (used for ReLU)



 
 


  • Glorot normal (used for tanh)
  • 1. Start simple
slide-46
SLIDE 46

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Normalize inputs

Starting simple

46

Choose a simple architecture Simplify the problem Use sensible defaults b a c

Steps

d

  • 1. Start simple
slide-47
SLIDE 47

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Important to normalize scale of input data

47

  • Subtract mean and divide by variance
  • For images, fine to scale values to [0, 1] or [-0.5, 0.5]


(e.g., by dividing by 255)
 [Careful, make sure your library doesn’t do it for you!]

  • 1. Start simple
slide-48
SLIDE 48

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Normalize inputs

Starting simple

48

Choose a simple architecture Simplify the problem Use sensible defaults b a c

Steps

d

  • 1. Start simple
slide-49
SLIDE 49

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Consider simplifying the problem as well

49

  • Start with a small training set (~10,000 examples)
  • Use a fixed number of objects, classes, image

size, etc.

  • Create a simpler synthetic training set
  • 1. Start simple
slide-50
SLIDE 50

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Simplest model for pedestrian detection

50

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • Start with a subset of 10,000 images

for training, 1,000 for val, and 500 for test

  • Use a LeNet architecture with

sigmoid cross-entropy loss

  • Adam optimizer with LR 3e-4
  • No regularization
  • 1. Start simple
slide-51
SLIDE 51

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Normalize inputs

Starting simple

51

Choose a simple architecture Simplify the problem Use sensible defaults b a c

Steps

d

Summary

  • Start with a simpler

version of your problem (e.g., smaller dataset)

  • Adam optimizer & no

regularization

  • Subtract mean and divide

by std, or just divide by 255 (ims)

  • LeNet, LSTM, or Fully

Connected

  • 1. Start simple
slide-52
SLIDE 52

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Tune hyper- parameters Implement & debug Start simple Evaluate Improve model/data Meets re- quirements

Strategy for DL troubleshooting

52

slide-53
SLIDE 53

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Implementing bug-free DL models

53

Get your model to run Compare to a known result Overfit a single batch b a c Steps

  • 2. Implement & debug
slide-54
SLIDE 54

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Preview: the five most common DL bugs

54

  • Incorrect shapes for your tensors


Can fail silently! E.g., accidental broadcasting: x.shape = (None,), y.shape = (None, 1), (x+y).shape = (None, None)

  • Pre-processing inputs incorrectly


E.g., Forgetting to normalize, or too much pre-processing

  • Incorrect input to your loss function


E.g., softmaxed outputs to a loss that expects logits

  • Forgot to set up train mode for the net correctly


E.g., toggling train/eval, controlling batch norm dependencies

  • Numerical instability - inf/NaN


Often stems from using an exp, log, or div operation

  • 2. Implement & debug
slide-55
SLIDE 55

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

General advice for implementing your model

55

Lightweight implementation

  • Minimum possible new lines of code

for v1

  • Rule of thumb: <200 lines
  • (Tested infrastructure components

are fine)

  • 2. Implement & debug
slide-56
SLIDE 56

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

General advice for implementing your model

56

Use off-the-shelf components, e.g.,

  • Keras
  • tf.layers.dense(…) 


instead of 
 tf.nn.relu(tf.matmul(W, x))

  • tf.losses.cross_entropy(…) 


instead of writing out the exp Lightweight implementation

  • Minimum possible new lines of code

for v1

  • Rule of thumb: <200 lines
  • (Tested infrastructure components

are fine)

  • 2. Implement & debug
slide-57
SLIDE 57

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

General advice for implementing your model

57

Build complicated data pipelines later

  • Start with a dataset you can load into

memory Use off-the-shelf components, e.g.,

  • Keras
  • tf.layers.dense(…) 


instead of 
 tf.nn.relu(tf.matmul(W, x))

  • tf.losses.cross_entropy(…) 


instead of writing out the exp Lightweight implementation

  • Minimum possible new lines of code

for v1

  • Rule of thumb: <200 lines
  • (Tested infrastructure components

are fine)

  • 2. Implement & debug
slide-58
SLIDE 58

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Implementing bug-free DL models

58

Get your model to run Compare to a known result Overfit a single batch b a c Steps

  • 2. Implement & debug
slide-59
SLIDE 59

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Get your model to run a

Shape mismatch Casting issue OOM Other Common issues Recommended resolution Scale back memory intensive

  • perations one-by-one

Standard debugging toolkit (Stack Overflow + interactive debugger)

Implementing bug-free DL models

59

Step through model creation and inference in a debugger

  • 2. Implement & debug
slide-60
SLIDE 60

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Get your model to run a

Shape mismatch Casting issue OOM Other Common issues Recommended resolution Scale back memory intensive

  • perations one-by-one

Standard debugging toolkit (Stack Overflow + interactive debugger)

Implementing bug-free DL models

60

Step through model creation and inference in a debugger

  • 2. Implement & debug
slide-61
SLIDE 61

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Debuggers for DL code

61

  • Pytorch: easy, use ipdb
  • tensorflow: trickier 



 Option 1: step through graph creation

  • 2. Implement & debug
slide-62
SLIDE 62

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Debuggers for DL code

62

  • Pytorch: easy, use ipdb
  • tensorflow: trickier 



 Option 2: step into training loop

Evaluate tensors using sess.run(…)

  • 2. Implement & debug
slide-63
SLIDE 63

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Debuggers for DL code

63

  • Pytorch: easy, use ipdb
  • tensorflow: trickier 



 Option 3: use tfdb

Stops execution at each sess.run(…) and lets you inspect

python -m tensorflow.python.debug.examples.debug_mnist --debug

  • 2. Implement & debug
slide-64
SLIDE 64

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Get your model to run a

Shape mismatch Casting issue OOM Other Common issues Recommended resolution Scale back memory intensive

  • perations one-by-one

Standard debugging toolkit (Stack Overflow + interactive debugger)

Implementing bug-free DL models

64

Step through model creation and inference in a debugger

  • 2. Implement & debug
slide-65
SLIDE 65

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Shape mismatch

Undefined shapes Incorrect shapes Common issues Most common causes

  • Flipped dimensions when using tf.reshape(…)
  • Took sum, average, or softmax over wrong

dimension

  • Forgot to flatten after conv layers
  • Forgot to get rid of extra “1” dimensions (e.g., if

shape is (None, 1, 1, 4)

  • Data stored on disk in a different dtype than

loaded (e.g., stored a float64 numpy array, and loaded it as a float32)

Implementing bug-free DL models

65

  • Confusing tensor.shape, tf.shape(tensor),

tensor.get_shape()

  • Reshaping things to a shape of type Tensor (e.g.,

when loading data from a file)

  • 2. Implement & debug
slide-66
SLIDE 66

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Casting issue

Data not in float32 Common issues Most common causes

Implementing bug-free DL models

66

  • Forgot to cast images from uint8 to float32
  • Generated data using numpy in float64, forgot to

cast to float32

  • 2. Implement & debug
slide-67
SLIDE 67

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

OOM

Too big a tensor Too much data Common issues Most common causes Duplicating

  • perations

Other processes

  • Other processes running on your GPU
  • Memory leak due to creating multiple models in

the same session

  • Repeatedly creating an operation (e.g., in a

function that gets called over and over again)

  • Too large a batch size for your model (e.g.,

during evaluation)

  • Too large fully connected layers
  • Loading too large a dataset into memory, rather

than using an input queue

  • Allocating too large a buffer for dataset creation

Implementing bug-free DL models

67

  • 2. Implement & debug
slide-68
SLIDE 68

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Other common errors

Other bugs Common issues Most common causes

Implementing bug-free DL models

68

  • Forgot to initialize variables
  • Forgot to turn off bias when using batch norm
  • “Fetch argument has invalid type” - usually you
  • verwrote one of your ops with an output during

training

  • 2. Implement & debug
slide-69
SLIDE 69

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Get your model to run Compare to a known result Overfit a single batch b a c Steps

Implementing bug-free DL models

69

  • 2. Implement & debug
slide-70
SLIDE 70

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Overfit a single batch b

Error goes up Error oscillates

Common issues

Error explodes Error plateaus

Most common causes

Implementing bug-free DL models

70

  • 2. Implement & debug
slide-71
SLIDE 71

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Overfit a single batch b

Error goes up Error oscillates

Common issues

Error explodes Error plateaus

Most common causes

Implementing bug-free DL models

71

  • Flipped the sign of the loss function / gradient
  • Learning rate too high
  • Softmax taken over wrong dimension
  • 2. Implement & debug
slide-72
SLIDE 72

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Overfit a single batch b

Error goes up Error oscillates

Common issues

Error explodes Error plateaus

Most common causes

  • Numerical issue. Check all exp, log, and div operations
  • Learning rate too high

Implementing bug-free DL models

72

  • 2. Implement & debug
slide-73
SLIDE 73

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Overfit a single batch b

Error goes up Error oscillates

Common issues

Error explodes Error plateaus

Most common causes

  • Data or labels corrupted (e.g., zeroed, incorrectly

shuffled, or preprocessed incorrectly)

  • Learning rate too high

Implementing bug-free DL models

73

  • 2. Implement & debug
slide-74
SLIDE 74

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Overfit a single batch b

Error goes up Error oscillates

Common issues

Error explodes Error plateaus

Most common causes

  • Learning rate too low
  • Gradients not flowing through the whole model
  • Too much regularization
  • Incorrect input to loss function (e.g., softmax instead of

logits, accidentally add ReLU on output)

  • Data or labels corrupted

Implementing bug-free DL models

74

  • 2. Implement & debug
slide-75
SLIDE 75

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Overfit a single batch b

Error goes up Error oscillates

Common issues

Error explodes Error plateaus

Most common causes

  • Numerical issue. Check all exp, log, and div operations
  • Learning rate too high
  • Data or labels corrupted (e.g., zeroed or incorrectly

shuffled)

  • Learning rate too high
  • Learning rate too low
  • Gradients not flowing through the whole model
  • Too much regularization
  • Incorrect input to loss function (e.g., softmax instead of

logits)

  • Data or labels corrupted

Implementing bug-free DL models

75

  • Flipped the sign of the loss function / gradient
  • Learning rate too high
  • Softmax taken over wrong dimension
  • 2. Implement & debug
slide-76
SLIDE 76

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Get your model to run Compare to a known result Overfit a single batch b a c Steps

Implementing bug-free DL models

76

  • 2. Implement & debug
slide-77
SLIDE 77

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Hierarchy of known results

77

More useful Less useful

You can: 


  • Walk through code line-by-line and

ensure you have the same output

  • Ensure your performance is up to par

with expectations

  • Official model implementation evaluated on similar

dataset to yours

  • 2. Implement & debug
slide-78
SLIDE 78

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

  • Official model implementation evaluated on similar

dataset to yours

  • Official model implementation evaluated on benchmark

(e.g., MNIST)

Hierarchy of known results

78

More useful Less useful

You can: 


  • Walk through code line-by-line and

ensure you have the same output

  • 2. Implement & debug
slide-79
SLIDE 79

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

  • Official model implementation evaluated on similar dataset

to yours

  • Official model implementation evaluated on benchmark

(e.g., MNIST)

  • Unofficial model implementation
  • Results from the paper (with no code)
  • Results from your model on a benchmark dataset (e.g.,

MNIST)

  • Results from a similar model on a similar dataset
  • Super simple baselines (e.g., average of outputs or linear

regression)

Hierarchy of known results

79

More useful Less useful

You can: 


  • Same as before, but with lower

confidence

  • 2. Implement & debug
slide-80
SLIDE 80

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Hierarchy of known results

80

More useful Less useful

You can: 


  • Ensure your performance is up to par

with expectations

  • Official model implementation evaluated on similar

dataset to yours

  • Official model implementation evaluated on benchmark

(e.g., MNIST)

  • Unofficial model implementation
  • Results from a paper (with no code)
  • 2. Implement & debug
slide-81
SLIDE 81

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Hierarchy of known results

81

More useful Less useful

  • Official model implementation evaluated on similar

dataset to yours

  • Official model implementation evaluated on benchmark

(e.g., MNIST)

  • Unofficial model implementation
  • Results from the paper (with no code)
  • Results from your model on a benchmark dataset (e.g.,

MNIST) You can: 


  • Make sure your model performs well in a

simpler setting

  • 2. Implement & debug
slide-82
SLIDE 82

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

  • Official model implementation evaluated on similar dataset

to yours

  • Official model implementation evaluated on benchmark

(e.g., MNIST)

  • Unofficial model implementation
  • Results from the paper (with no code)
  • Results from your model on a benchmark dataset (e.g.,

MNIST)

  • Results from a similar model on a similar dataset
  • Super simple baselines (e.g., average of outputs or linear

regression)

Hierarchy of known results

82

More useful Less useful

You can: 


  • Get a general sense of what kind of

performance can be expected

  • 2. Implement & debug
slide-83
SLIDE 83

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Hierarchy of known results

83

More useful Less useful

  • Official model implementation evaluated on similar dataset

to yours

  • Official model implementation evaluated on benchmark

(e.g., MNIST)

  • Unofficial model implementation
  • Results from the paper (with no code)
  • Results from your model on a benchmark dataset (e.g.,

MNIST)

  • Results from a similar model on a similar dataset
  • Super simple baselines (e.g., average of outputs or linear

regression) You can: 


  • Make sure your model is learning

anything at all

  • 2. Implement & debug
slide-84
SLIDE 84

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Hierarchy of known results

84

More useful Less useful

  • Official model implementation evaluated on similar dataset

to yours

  • Official model implementation evaluated on benchmark

(e.g., MNIST)

  • Unofficial model implementation
  • Results from the paper (with no code)
  • Results from your model on a benchmark dataset (e.g.,

MNIST)

  • Results from a similar model on a similar dataset
  • Super simple baselines (e.g., average of outputs or linear

regression)

  • 2. Implement & debug
slide-85
SLIDE 85

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Summary: how to implement & debug

85

Get your model to run Compare to a known result Overfit a single batch b a c Steps Summary

  • Look for corrupted data, over-

regularization, broadcasting errors

  • Keep iterating until model performs

up to expectations

  • Step through in debugger & watch out

for shape, casting, and OOM errors

  • 2. Implement & debug
slide-86
SLIDE 86

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Strategy for DL troubleshooting

86

Tune hyper- parameters Implement & debug Start simple Evaluate Improve model/data Meets re- quirements

slide-87
SLIDE 87

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Bias-variance decomposition

87

  • 3. Evaluate
slide-88
SLIDE 88

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Bias-variance decomposition

88

  • 3. Evaluate
slide-89
SLIDE 89

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Bias-variance decomposition

89

  • 3. Evaluate
slide-90
SLIDE 90

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

25 2 27 5 32 2 34 I r r e d u c i b l e e r r

  • r

A v

  • i

d a b l e b i a s ( i . e . , u n d e r f i t t i n g ) T r a i n e r r

  • r

V a r i a n c e ( i . e . ,

  • v

e r f i t t i n g ) V a l e r r

  • r

V a l s e t

  • v

e r f i t t i n g T e s t e r r

  • r

5 10 15 20 25 30 35 40 Breakdown of test error by source

Bias-variance decomposition

90

  • 3. Evaluate
slide-91
SLIDE 91

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Bias-variance decomposition

91

Test error = irreducible error + bias + variance + val overfitting This assumes train, val, and test all come from the same

  • distribution. What if not?
  • 3. Evaluate
slide-92
SLIDE 92

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Handling distribution shift

92

Test data Use two val sets: one sampled from training distribution and

  • ne from test distribution

Train data

  • 3. Evaluate
slide-93
SLIDE 93

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

The bias-variance tradeoff

93

  • 3. Evaluate
slide-94
SLIDE 94

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Bias-variance with distribution shift

94

  • 3. Evaluate
slide-95
SLIDE 95

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Bias-variance with distribution shift

95

25 2 27 2 29 3 32 2 34 I r r e d u c i b l e e r r

  • r

A v

  • i

d a b l e b i a s ( i . e . , u n d e r f i t t i n g ) T r a i n e r r

  • r

V a r i a n c e T r a i n v a l e r r

  • r

D i s t r i b u t i

  • n

s h i f t T e s t v a l e r r

  • r

V a l

  • v

e r f i t t i n g T e s t e r r

  • r

5 10 15 20 25 30 35 40 Breakdown of test error by source

  • 3. Evaluate
slide-96
SLIDE 96

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

96

Error source Value Goal performance 1% Train error 20% Validation error 27% Test error 28% Train - goal = 19%
 (under-fitting)

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 3. Evaluate
slide-97
SLIDE 97

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

97

Error source Value Goal performance 1% Train error 20% Validation error 27% Test error 28% Val - train = 7%
 (over-fitting)

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 3. Evaluate
slide-98
SLIDE 98

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

98

Error source Value Goal performance 1% Train error 20% Validation error 27% Test error 28% Test - val = 1%
 (looks good!)

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 3. Evaluate
slide-99
SLIDE 99

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Summary: evaluating model performance

99

Test error = irreducible error + bias + variance 
 + distribution shift + val overfitting

  • 3. Evaluate
slide-100
SLIDE 100

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Strategy for DL troubleshooting

100

Tune hyper- parameters Implement & debug Start simple Evaluate Improve model/data Meets re- quirements

slide-101
SLIDE 101

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Address distribution shift

Prioritizing improvements (i.e., applied b-v)

101

Address under-fitting Re-balance datasets 
 (if applicable) Address over-fitting b a c

Steps

d

  • 4. Improve
slide-102
SLIDE 102

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Addressing under-fitting (i.e., reducing bias)

102

Try first Try later

A. Make your model bigger (i.e., add layers or use more units per layer) B. Reduce regularization C. Error analysis D. Choose a different (closer to state-of-the art) model architecture (e.g., move from LeNet to ResNet) E. Tune hyper-parameters (e.g., learning rate) F . Add features

  • 4. Improve
slide-103
SLIDE 103

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

103

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy 
 (i.e., 1% error)

Error source Value Value Goal performance 1% 1% Train error 20% 7% Validation error 27% 19% Test error 28% 20%

Add more layers to the ConvNet

  • 4. Improve
slide-104
SLIDE 104

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

104

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy 
 (i.e., 1% error)

Error source Value Value Value Goal performance 1% 1% 1% Train error 20% 7% 3% Validation error 27% 19% 10% Test error 28% 20% 10%

Switch to ResNet-101

  • 4. Improve
slide-105
SLIDE 105

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

105

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy 
 (i.e., 1% error)

Error source Value Value Value Value Goal performance 1% 1% 1% 1% Train error 20% 7% 3% 0.8% Validation error 27% 19% 10% 12% Test error 28% 20% 10% 12%

Tune learning rate

  • 4. Improve
slide-106
SLIDE 106

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Address distribution shift

Prioritizing improvements (i.e., applied b-v)

106

Address under-fitting Re-balance datasets 
 (if applicable) Address over-fitting b a c

Steps

d

  • 4. Improve
slide-107
SLIDE 107

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Addressing over-fitting (i.e., reducing variance)

107

Try first Try later

A. Add more training data (if possible!) B. Add normalization (e.g., batch norm, layer norm) C. Add data augmentation D. Increase regularization (e.g., dropout, L2, weight decay) E. Error analysis F . Choose a different (closer to state-of-the-art) model architecture

  • G. Tune hyperparameters

H. Early stopping I. Remove features J. Reduce model size

  • 4. Improve
slide-108
SLIDE 108

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Addressing over-fitting (i.e., reducing variance)

108

Try first Try later Not recommended!

A. Add more training data (if possible!) B. Add normalization (e.g., batch norm, layer norm) C. Add data augmentation D. Increase regularization (e.g., dropout, L2, weight decay) E. Error analysis F . Choose a different (closer to state-of-the-art) model architecture

  • G. Tune hyperparameters

H. Early stopping I. Remove features J. Reduce model size

  • 4. Improve
slide-109
SLIDE 109

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

109

Error source Value Goal performance 1% Train error 0.8% Validation error 12% Test error 12%

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 4. Improve
slide-110
SLIDE 110

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

110

Error source Value Value Goal performance 1% 1% Train error 0.8% 1.5% Validation error 12% 5% Test error 12% 6%

Increase dataset size to 250,000

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 4. Improve
slide-111
SLIDE 111

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

111

Error source Value Value Value Goal performance 1% 1% 1% Train error 0.8% 1.5% 1.7% Validation error 12% 5% 4% Test error 12% 6% 4%

Add weight decay

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 4. Improve
slide-112
SLIDE 112

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

112

Error source Value Value Value Value Goal performance 1% 1% 1% 1% Train error 0.8% 1.5% 1.7% 2% Validation error 12% 5% 4% 2.5% Test error 12% 6% 4% 2.6%

Add data augmentation

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 4. Improve
slide-113
SLIDE 113

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Train, val, and test error for pedestrian detection

113

Error source Value Value Value Value Value Goal performance 1% 1% 1% 1% 1% Train error 0.8% 1.5% 1.7% 2% 0.6% Validation error 12% 5% 4% 2.5% 0.9% Test error 12% 6% 4% 2.6% 1.0%

Tune num layers, optimizer params, weight initialization, kernel size, weight decay

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 4. Improve
slide-114
SLIDE 114

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Address distribution shift

Prioritizing improvements (i.e., applied b-v)

114

Address under-fitting Re-balance datasets 
 (if applicable) Address over-fitting b a c

Steps

d

  • 4. Improve
slide-115
SLIDE 115

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Addressing distribution shift

115

Try first Try later A. Analyze test-val set errors & collect more training data to compensate B. Analyze test-val set errors & synthesize more training data to compensate C. Apply domain adaptation techniques to training & test distributions

  • 4. Improve
slide-116
SLIDE 116

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Error analysis

116

Test-val set errors (no pedestrian detected) Train-val set errors (no pedestrian detected)

  • 4. Improve
slide-117
SLIDE 117

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Error analysis

117

Test-val set errors (no pedestrian detected) Train-val set errors (no pedestrian detected)

Error type 1: hard-to-see pedestrians

  • 4. Improve
slide-118
SLIDE 118

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Error analysis

118

Test-val set errors (no pedestrian detected) Train-val set errors (no pedestrian detected)

Error type 2: reflections

  • 4. Improve
slide-119
SLIDE 119

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Error analysis

119

Test-val set errors (no pedestrian detected) Train-val set errors (no pedestrian detected)

Error type 3 (test-val only): night scenes

  • 4. Improve
slide-120
SLIDE 120

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Error analysis

120

Error type Error % (train-val) Error % (test-val) Potential solutions Priority

  • 1. Hard-to-see

pedestrians 0.1% 0.1%

  • Better sensors

Low

  • 2. Reflections

0.3% 0.3%

  • Collect more data with reflections
  • Add synthetic reflections to train set
  • Try to remove with pre-processing
  • Better sensors

Medium

  • 3. Nighttime

scenes 0.1% 1%

  • Collect more data at night
  • Synthetically darken training images
  • Simulate night-time data
  • Use domain adaptation

High

  • 4. Improve
slide-121
SLIDE 121

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Domain adaptation

121

What is it? Techniques to train on “source” distribution and generalize to another “target” using only unlabeled data or limited labeled data When should you consider using it?

  • Access to labeled data from test

distribution is limited

  • Access to relatively similar data is

plentiful

  • 4. Improve
slide-122
SLIDE 122

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Types of domain adaptation

122

Type Use case Example techniques Supervised You have limited data from target domain

  • Fine-tuning a pre-

trained model

  • Adding target data to

train set Un-supervised You have lots of un- labeled data from target domain

  • Correlation Alignment

(CORAL)

  • Domain confusion
  • CycleGAN
  • 4. Improve
slide-123
SLIDE 123

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Address distribution shift

Prioritizing improvements (i.e., applied b-v)

123

Address under-fitting Re-balance datasets 
 (if applicable) Address over-fitting b a c

Steps

d

  • 4. Improve
slide-124
SLIDE 124

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Rebalancing datasets

124

  • If (test)-val looks significantly better than test,

you overfit to the val set

  • This happens with small val sets or lots of hyper

parameter tuning

  • When it does, recollect val data
  • 4. Improve
slide-125
SLIDE 125

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Strategy for DL troubleshooting

125

Tune hyper- parameters Implement & debug Start simple Evaluate Improve model/data Meets re- quirements

slide-126
SLIDE 126

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Hyperparameter optimization

126

Model & optimizer choices? Network: ResNet

  • How many layers?
  • Weight initialization?
  • Kernel size?
  • Etc

Optimizer: Adam

  • Batch size?
  • Learning rate?
  • beta1, beta2, epsilon?

Regularization

  • ….

0 (no pedestrian) 1 (yes pedestrian)

Goal: 99% classification accuracy

Running example

  • 5. Tune
slide-127
SLIDE 127

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Which hyper-parameters to tune?

127

Choosing hyper-parameters

  • More sensitive to some than others
  • Depends on choice of model
  • Rules of thumb (only) to the right
  • Sensitivity is relative to default values! 


(e.g., if you are using all-zeros weight initialization or vanilla SGD, changing to the defaults will make a big difference)

Hyperparameter Approximate sensitivity Learning rate High Learning rate schedule High Optimizer choice Low Other optimizer params
 (e.g., Adam beta1) Low Batch size Low Weight initialization Medium Loss function High Model depth Medium Layer size High Layer params 
 (e.g., kernel size) Medium Weight of regularization Medium Nonlinearity Low

  • 5. Tune
slide-128
SLIDE 128

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 1: manual hyperparam optimization

128

How it works

  • Understand the algorithm
  • E.g., higher learning rate means faster

less stable training

  • Train & evaluate model
  • Guess a better hyperparam value & re-

evaluate

  • Can be combined with other methods

(e.g., manually select parameter ranges to

  • ptimizer over)

Advantages Disadvantages

  • For a skilled practitioner, may require least

computation to get good result

  • Requires detailed understanding of the

algorithm

  • Time-consuming
  • 5. Tune
slide-129
SLIDE 129

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 2: grid search

129

Hyperparameter 1 (e.g., batch size) Hyperparameter 2 (e.g., learning rate) How it works Disadvantages

  • Super simple to implement
  • Can produce good results
  • Not very efficient: need to train on all

cross-combos of hyper-parameters

  • May require prior knowledge about

parameters to get
 good results Advantages

  • 5. Tune
slide-130
SLIDE 130

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 3: random search

130

Hyperparameter 1 (e.g., batch size) Hyperparameter 2 (e.g., learning rate) How it works Disadvantages Advantages

  • 5. Tune
  • Easy to implement
  • Often produces better results than grid

search

  • Not very interpretable
  • May require prior knowledge about

parameters to get
 good results

slide-131
SLIDE 131

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 4: coarse-to-fine

131

Hyperparameter 1 (e.g., batch size) Hyperparameter 2 (e.g., learning rate) How it works Disadvantages Advantages

  • 5. Tune
slide-132
SLIDE 132

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 4: coarse-to-fine

132

Hyperparameter 1 (e.g., batch size) Hyperparameter 2 (e.g., learning rate) How it works Disadvantages Advantages Best performers

  • 5. Tune
slide-133
SLIDE 133

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 4: coarse-to-fine

133

Hyperparameter 1 (e.g., batch size) Hyperparameter 2 (e.g., learning rate) How it works Disadvantages Advantages

  • 5. Tune
slide-134
SLIDE 134

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 4: coarse-to-fine

134

Hyperparameter 1 (e.g., batch size) Hyperparameter 2 (e.g., learning rate) How it works Disadvantages Advantages

  • 5. Tune
slide-135
SLIDE 135

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 4: coarse-to-fine

135

Hyperparameter 1 (e.g., batch size) Hyperparameter 2 (e.g., learning rate) How it works Disadvantages

  • Can narrow in on very high performing

hyperparameters

  • Most used method in practice
  • Somewhat manual process

Advantages etc.

  • 5. Tune
slide-136
SLIDE 136

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 5: Bayesian hyperparam opt

136

How it works (at a high level)

  • Start with a prior estimate of parameter

distributions

  • Maintain a probabilistic model of the

relationship between hyper-parameter values and model performance

  • Alternate between:
  • Training with the hyper-parameter

values that maximize the expected improvement

  • Using training results to update our

probabilistic model

  • To learn more, see:


 Advantages Disadvantages

  • Generally the most efficient hands-off way

to choose hyperparameters

  • Difficult to implement from scratch
  • Can be hard to integrate with off-the-shelf

tools

https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

  • 5. Tune
slide-137
SLIDE 137

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Method 5: Bayesian hyperparam opt

137

How it works (at a high level)

  • Start with a prior estimate of parameter

distributions

  • Maintain a probabilistic model of the

relationship between hyper-parameter values and model performance

  • Alternate between:
  • Training with the hyper-parameter

values that maximize the expected improvement

  • Using training results to update our

probabilistic model

  • To learn more, see:


 Advantages Disadvantages

  • Generally the most efficient hands-off way

to choose hyperparameters

  • Difficult to implement from scratch
  • Can be hard to integrate with off-the-shelf

tools More on tools to do this automatically in the infrastructure & tooling lecture!

https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

  • 5. Tune
slide-138
SLIDE 138

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Summary of how to optimize hyperparams

138

  • Coarse-to-fine random searches
  • Consider Bayesian hyper-parameter
  • ptimization solutions as your

codebase matures

  • 5. Tune
slide-139
SLIDE 139

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Conclusion

139

slide-140
SLIDE 140

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Conclusion

140

  • DL debugging is hard due to many

competing sources of error

  • To train bug-free DL models, we treat

building our model as an iterative process

  • The following steps can make the process

easier and catch errors as early as possible

slide-141
SLIDE 141

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

How to build bug-free DL models

141

Overview Tune hyp- eparams Implement & debug Start 
 simple Evaluate Improve model/data

  • Choose the simplest model & data possible

(e.g., LeNet on a subset of your data)

  • Once model runs, overfit a single batch &

reproduce a known result

  • Apply the bias-variance decomposition to

decide what to do next

  • Use coarse-to-fine random searches
  • Make your model bigger if you underfit; add

data or regularize if you overfit

slide-142
SLIDE 142

Full Stack Deep Learning (March 2019) Pieter Abbeel, Sergey Karayev, Josh Tobin L6: Troubleshooting

Where to go to learn more

142

  • Andrew Ng’s book Machine Learning

Yearning (http://www.mlyearning.org/)

  • The following Twitter thread:


https://twitter.com/karpathy/status/ 1013244313327681536

  • This blog post: 


https://pcc.cs.byu.edu/2017/10/02/ practical-advice-for-building-deep-neural- networks/