CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep - - PowerPoint PPT Presentation

cradle cross backend validation to detect and localize
SMART_READER_LITE
LIVE PREVIEW

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep - - PowerPoint PPT Presentation

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries Hung Viet Pham 1 Thibaud Lutellier 1 Weizhen Qi 2 Lin Tan 3 1 University of Waterloo, Canada 2 University of Science and Technology of China, China 3 Purdue


slide-1
SLIDE 1

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries

1University of Waterloo, Canada 2University of Science and Technology of China, China 3Purdue University, USA

Lin Tan3 Thibaud Lutellier1 Weizhen Qi2 Hung Viet Pham1

slide-2
SLIDE 2

Deep learning (DL) is pervasive

Machine translation Alzheimer’s disease diagnosis Autonomous driving cars Virtual assistance

2

slide-3
SLIDE 3

DL system

Correct DL systems require correct implementations

3

Algorithms / Models Implementations

CRADLE CRADLE

slide-4
SLIDE 4

DL libraries are hard to test and debug

  • Intrinsic complexity
  • DL system expected output is

unknown

○ Correct programs should output expected output. ○ The ground truth is not the expected

  • utput because models are not perfect.

4

MobileNetV2 - TensorFlow: banana Ground-truth: banana MobileNetV2 Expected output: tennis ball

slide-5
SLIDE 5

Idea: Differential testing

InceptionResNetV2 Model An inconsistency A “petri-dish” image TensorFlow classification TensorFlow backend InceptionResNetV2TensorFlow InceptionResNetV2CNTK CNTK backend CNTK classification

5

slide-6
SLIDE 6
  • The CNTK batch normalization formula was implemented

incorrectly.

  • The developers fixed the bug after we reported it.

Batch_normalization bug

6

  • return(x-mean)/(C.sqrt(var)+epsilon)*gamma+beta

+ return(x-mean)/ C.sqrt(var +epsilon)*gamma+beta

slide-7
SLIDE 7

Differential testing: Challenges

  • How to compare two implementations?

○ What metric to use? ○ What should be considered bugs?

  • How to localize the faults?

○ How to find faults in the complex model executions?

7

slide-8
SLIDE 8

Differential testing: Ideas

  • Two metrics measure the severity of the inconsistency for

a set of input instances.

  • Localization map compares intermediate states of DL

models for fault localization.

8

slide-9
SLIDE 9

CRADLE: Overview

9

Localization phase Detection phase

Output extractor Crash bugs Model output Unique inconsistencies Trained models & Validation data Hidden states extractor Hidden states Inconsistency localizer Localization maps Output comparator Inconsistency bugs

slide-10
SLIDE 10

CRADLE: Detection phase

10

Localization phase Detection phase

Output extractor Crash bugs Model output Unique inconsistencies Trained models & Validation data Hidden states extractor Hidden states Inconsistency localizer Localization maps Output comparator Inconsistency bugs

slide-11
SLIDE 11

Output extractor

11

  • Executes the models on different backends to obtain output
  • Detects crashes

InceptionResNetV2 Model An “petri-dish” image InceptionResNetV2CNTK CNTK backend CNTK classification

slide-12
SLIDE 12

MAD-based (Regression) CLASS-based (Classification)

Output comparator: Distance metrics

12

Metrics calculate difference relatively to the ground-truth.

slide-13
SLIDE 13

σpetri-dish,TF = 25-1 = 16

CLASS-based distance example

13

TensorFlow CNTK

σpetri-dish,CN = 0

|σpetri-dish,CN - σpetri-dish,CN| = 16

Rankpetri-dish,TF = 1 Rankpetri-dish,CN > 5 Top-5 classification

slide-14
SLIDE 14

Inconsistency triggering input (ITI)

  • An input instance triggers a distance larger than a

threshold (TC and TM)

○ E.g.,: “petri-dish” image is an ITI given TC = 8.

14

Theano: Indian elephant TensorFlow: groom CNTK: groom TensorFlow: banana CNTK: tennis ball Theano: tennis ball CNTK: Arabian camel TensorFlow: hen Theano: hen

slide-15
SLIDE 15

Detect inconsistency

  • An inconsistency is a pair of implementations that triggers

more than p% of ITIs over the validation set

15

InceptionResNetV2TensorFlow InceptionResNetV2CNTK

16 6

Validation set

TC = 8

D_CLASS

p = 10%

slide-16
SLIDE 16

CRADLE: Localization phase

16

Localization phase Detection phase

Output extractor Crash bugs Model output Unique inconsistencies Trained models & Validation data Hidden states extractor Hidden states Inconsistency localizer Localization maps Output comparator Inconsistency bugs

slide-17
SLIDE 17

Hidden state extractor

  • The “most inconsistent” input per inconsistency is used.
  • The network structure + hidden states are considered as

the network execution graph.

  • Hidden states are output of hidden layers.

17

BatchNorm Conv2D GloAvgPool

776 layers

  • mitted

Activation Dense

TensorFlow: jean Input: jean

Conv2D BatchNorm Activation GloAvgPool

InceptionResNetV2 execution graph on TensorFlow

slide-18
SLIDE 18

MAD differences

18

BatchNorm Conv2D GloAvgPool

776 layers

  • mitted

Activation Dense

TensorFlow: jean

Conv2D BatchNorm Activation GloAvgPool BatchNorm Conv2D GloAvgPool

776 layers

  • mitted

Activation Dense

CNTK: mail bag

Conv2D BatchNorm Activation GloAvgPool

Input: jean

BatchNorm 𝜀 = 0.0002 Conv2D 𝜀 = 0.0 GloAvgPool 𝜀 = 0.0860 Activation 𝜀 = 0.1480

776 layers

  • mitted

Dense 𝜀 = 0.0004

slide-19
SLIDE 19

Inconsistency introduction rate

  • Calculate the rate of change

○ ∊ prevent division by zero

  • Highlight executions with R

above the third quantile

InceptionResNetV2 localization map between TensorFlow and CNTK

19 BatchNorm 𝜀 = 0.0002 R = 2048.6 Conv2D 𝜀 = 0.0 R = 0.0 Activation 𝜀 < 0.0001 R =-0.5497 Conv2D 𝜀 = 0.0003 R = 2.3009 GloAvgPool 𝜀 = 0.0860 R =-0.4191

772 layers

  • mitted

Activation 𝜀 = 0.1480 R =-0.5173 Dense 𝜀 = 0.0004 R =-0.9950

TensorFlow: jean CNTK: mailbag Input: jean

Conv2D 𝜀 = 0.0138 R = 0.5530 BatchNorm 𝜀 = 0.3067 R = 21.186

slide-20
SLIDE 20

20

104 unique inconsistencies 3 backends 28 models 11 datasets 7 inconsistency bugs 5 crash bugs

Result

slide-21
SLIDE 21

fy

21

7 inconsistency bugs

Batch normalization BatchNormalization Padding scheme Conv2D variant Pooling scheme AveragePooling2D Parameter organization Trainable Conv

slide-22
SLIDE 22

Relevant One of First

22

Localization is helpful

Relevant to the causes of all 104 unique inconsistencies

slide-23
SLIDE 23

Conclusion

  • CRADLE applies differential testing on DL implementations

and localize faulty functions by tracking error propagation. ○

Detects 7 confirmed inconsistency bugs and 5 crash bugs ○ Helps find root causes of all 104 unique inconsistencies using localization maps

  • Inconsistencies are common and widespread.
  • We call for more attention to testing of DL libraries.

23

slide-24
SLIDE 24

DL system overview

24

High-level Libraries Hardware Low-level Libraries TensorFlow Theano CNTK CPU GPU Keras User code Interface Backend

...

slide-25
SLIDE 25

Group unique inconsistency

  • A group of inconsistencies with the same inconsistency

pattern between the same pair of implementations

○ Inconsistency pattern is the distribution of metric distance

25

slide-26
SLIDE 26

Suggested settings

  • Grid search on TC, TM, and p values
  • Optimal settings (most inconsistency without false

negative and false positive) are:

○ CLASS-based: TC = 8 and p = 0% ○ MAD-based: TM = 0.2 and p = 0%

  • Confirm using cross-validation

26

slide-27
SLIDE 27

Dataset and hardware

  • Dataset:

○ 11 datasets including ImageNet, MNIST, Udachi Driving Challenge 2, etc. ○ 30 pre-trained models

  • Hardware:

○ Xeon E5-2695 ○ NVIDIA Titan Xp

27

slide-28
SLIDE 28

Detected inconsistencies

The numbers outside and (inside) brackets are the unique and (total) number of inconsistencies respectively.

28

slide-29
SLIDE 29

Comparison to accuracy

  • Detect inconsistency if the top-k accuracy difference is

above a threshold TAC

  • We pick k between 1 to 5 and TAC between 0 and 50
  • With TAC = 0, top-1 accuracy detects the most

inconsistencies (305) but still missed 35

○ E.g., for the Dog species model, the Batch_normalization bugs induce inconsistency between TensorFlow and CNTK ○ However, those backends got identical top-1 (29.9%) and top-5 (64.4%) accuracies

29

slide-30
SLIDE 30

Future work

  • Detect inconsistencies and bugs in training code

○ Harder since training is non-deterministic

  • Generate mutated models using fuzzing to expand testing

set

  • Testing with only one backend with equivalent models

30