Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in - - PowerPoint PPT Presentation

exploiting hidden layer modular redundancy for fault
SMART_READER_LITE
LIVE PREVIEW

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in - - PowerPoint PPT Presentation

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators Schuyler Eldridge Ajay Joshi Department of Electrical and Computer Engineering, Boston University schuye@bu.edu January 30, 2015 This work was


slide-1
SLIDE 1

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators

Schuyler Eldridge Ajay Joshi

Department of Electrical and Computer Engineering, Boston University schuye@bu.edu

January 30, 2015

This work was supported by a NASA Office of the Chief Technologist’s Space Technology Research Fellowship. schuye@bu.edu 30 Jan 2015 1/12

slide-2
SLIDE 2

Motivation

Leveraging CMOS Scaling for Improved Performance is Becoming Increasingly Hard Contributing factors making it difficult include:

Fixed power budgets An eventual slowdown of Moore’s Law

Computer engineers increasingly turn towards alternative designs Alternative Designs As an alternative, others are investigating general and special purpose accelerators One actively researched accelerator architecture is that of neural network accelerators

schuye@bu.edu 30 Jan 2015 2/12

slide-3
SLIDE 3

Artificial Neural Networks

I1

X1

Ii

Xi

. . .

bias H1 H2 Hh bias

. . .

O1

Y1

. . .

Oo

Yo Hidden Input Output

Figure: Two-layer neural network with i × h × o nodes.

Artificial Neural Network Directed graph of neurons Edges between neurons are weighted Use in Applications Machine Learning Big Data Approximate Computing State Prediction

schuye@bu.edu 30 Jan 2015 3/12

slide-4
SLIDE 4

Neural Networks and Fault-Tolerance

The Brain is Fault-Tolerant! Ergo neural networks are fault-tolerant This isn’t generally the case! Do Neural Networks have the potential for Fault-Tolerance? Neural networks have a redundant structure

There are multiple paths from input to output

Regression tasks often approximate smooth functions

Small changes in inputs or internal computations may only cause small changes in the output

However, there is no implicit guarantee of fault-tolerance unless you train a neural network to specifically demonstrate those properties

schuye@bu.edu 30 Jan 2015 4/12

slide-5
SLIDE 5

N-MR Technique

I1

X1

I2

X2

bias H1 H2 bias O1

Y1

O2

Y2

Figure: N-MR-1

Steps for Amount of Redundancy N

1 Replicate each hidden neuron N times 2 Replicate each hidden neuron connection for

each new neuron

3 Multiply all connection weights by 1/

N

schuye@bu.edu 30 Jan 2015 5/12

slide-6
SLIDE 6

N-MR Technique

I1

X1

I2

X2

bias H1 H2 bias O1

Y1

O2

Y2

Figure: N-MR-1

I1

X1

I2

X2

bias H1 H2 H3 H4 bias O1

Y1

O2

Y2

Figure: N-MR-2

schuye@bu.edu 30 Jan 2015 5/12

slide-7
SLIDE 7

N-MR Technique

I1

X1

I2

X2

bias H1 H2 bias O1

Y1

O2

Y2

Figure: N-MR-1

I1

X1

I2

X2

bias H1 H2 H3 H4 H5 H6 bias O1

Y1

O2

Y2

Figure: N-MR-3

schuye@bu.edu 30 Jan 2015 5/12

slide-8
SLIDE 8

N-MR Technique

I1

X1

I2

X2

bias H1 H2 bias O1

Y1

O2

Y2

Figure: N-MR-1

I1

X1

I2

X2

bias H1 H2 H3 H4 H5 H6 H7 H8 bias O1

Y1

O2

Y2

Figure: N-MR-4

schuye@bu.edu 30 Jan 2015 5/12

slide-9
SLIDE 9

Neural Network Accelerator Architecture

PE PE PE PE Intermediate Storage Control NN Config and Data Storage Unit PE Core Communication

Figure: Block diagram of our neural network accelerator

Basic Operation in a Multicore Environment Threads communicate neural network computation requests to this accelerator The accelerator allocates processing elements (PEs) to compute the outputs of all pending requests

schuye@bu.edu 30 Jan 2015 6/12

slide-10
SLIDE 10

Neural Network Accelerator Architecture

PE PE PE PE Intermediate Storage Control NN Config and Data Storage Unit PE Core Communication

Figure: Block diagram of our neural network accelerator

Basic Operation in a Multicore Environment Threads communicate neural network computation requests to this accelerator The accelerator allocates processing elements (PEs) to compute the outputs of all pending requests

schuye@bu.edu 30 Jan 2015 6/12

slide-11
SLIDE 11

Neural Network Accelerator Architecture

PE PE PE PE Intermediate Storage Control NN Config and Data Storage Unit PE Core Communication

Figure: Block diagram of our neural network accelerator

Basic Operation in a Multicore Environment Threads communicate neural network computation requests to this accelerator The accelerator allocates processing elements (PEs) to compute the outputs of all pending requests

schuye@bu.edu 30 Jan 2015 6/12

slide-12
SLIDE 12

Neural Network Accelerator Architecture

PE PE PE PE Intermediate Storage Control NN Config and Data Storage Unit PE Core Communication

Figure: Block diagram of our neural network accelerator

Basic Operation in a Multicore Environment Threads communicate neural network computation requests to this accelerator The accelerator allocates processing elements (PEs) to compute the outputs of all pending requests

schuye@bu.edu 30 Jan 2015 6/12

slide-13
SLIDE 13

Neural Network Accelerator Architecture

PE PE PE PE Intermediate Storage Control NN Config and Data Storage Unit PE Core Communication

Figure: Block diagram of our neural network accelerator

Basic Operation in a Multicore Environment Threads communicate neural network computation requests to this accelerator The accelerator allocates processing elements (PEs) to compute the outputs of all pending requests

schuye@bu.edu 30 Jan 2015 6/12

slide-14
SLIDE 14

Evaluation Overview

Table: Evaluated neural networks and their topologies

Application NN Topology Description blackscholes (b) [1] 6 × 8 × 8 × 1 Financial option pricing rsa (r) [2] 30 × 30 × 30 Brute-force prime factorization sobel (s) [1] 9 × 8 × 1 3 × 3 Sobel filter

Methodology We vary the amount of N-MR for the applications in Table 1 running on our NN accelerator architecture We introduce a random fault into a neuron and measure the accuracy and latency

  • R. St. Amant et al., “General-purpose code acceleration with limited-precision

analog computation,” in ISCA, 2014, pp. 505–516.

  • A. Waterland et al., “Asc: Automatically scalable computation,” in ASPLOS.

ACM, 2014, pp. 575–590.

schuye@bu.edu 30 Jan 2015 7/12

slide-15
SLIDE 15

Evaluation – Normalized Latency

1 3 5 7 2 4 6 Amount of N-MR Normalized Latency

blackscholes rsa sobel Linear Baseline

Figure: Latency normalized to N-MR-1

Latency Scaling with N-MR Work, where work is the number of edges to compute, scale with N-MR However, latency scales sublinearly for our accelerator Increasing N-MR means more work, but also more efficient use of the accelerator

schuye@bu.edu 30 Jan 2015 8/12

slide-16
SLIDE 16

Evaluation – Accuracy

1 3 5 7 100 101 102 103 104 Amount of N-MR Percentage Error Increase

blackscholes (MSE) rsa (% correct) sobel (MSE)

1 3 5 7 100 101 Amount of N-MR Normalized Accuracy

Figure: Left: percentage accuracy difference, Right: accuracy normalized to N-MR-1

Accuracy and N-MR Generally, accuracy improves with increasing N-MR

schuye@bu.edu 30 Jan 2015 9/12

slide-17
SLIDE 17

Evaluation – Combined Metrics

1 3 5 7 100 101 Amount of N-MR Normalized EDP

blackscholes rsa sobel

Figure: Energy-Delay Product (EDP) for varying N-MR

Cost of N-MR We evaluate the cost using Energy-Delay product (EDP) A high cost as N-MR increases both energy and delay

schuye@bu.edu 30 Jan 2015 10/12

slide-18
SLIDE 18

Discussion and Conclusion

An Initial Approach As neural network accelerators become mainstream, approaches to improve their fault-tolerance will have increased value N-MR is a preliminary step to leverage the potential for fault-tolerance in neural networks Other approaches do exist:

Training with faults Splitting important neurons and pruning unimportant ones

Future Directions Varying N-MR at run-time Faults are currently assumed to be intermittent, but by varying internal PE structure and enforcing scheduling neurons on different PEs, a more robust approach can be developed Run-time splitting of important nodes or not computing unimportant nodes

schuye@bu.edu 30 Jan 2015 11/12

slide-19
SLIDE 19

Summary and Questions

Figure: Latency, accuracy, and combined metrics

I1

X1

I2

X2

bias H1 H2 bias O1

Y1

O2

Y2

Figure: A two-layer NN

PE PE PE PE Intermediate Storage Control NN Config and Data Storage Unit PE Core Communication

Figure: NN accelerator architecture

schuye@bu.edu 30 Jan 2015 12/12