Towards General-Purpose Neural Network Computing Schuyler Eldridge 1 - - PowerPoint PPT Presentation

▶

Jan 25, 2024 327 likes •571 views

Towards General-Purpose Neural Network Computing Schuyler Eldridge 1 Amos Waterland 2 Margo Seltzer 2 Jonathan Appavoo 3 Ajay Joshi 1 1 Boston University Department of Electrical and Computer Engineering 2 Harvard University School of Engineering

SLIDE 1

Towards General-Purpose Neural Network Computing

Schuyler Eldridge1 Amos Waterland2 Margo Seltzer2 Jonathan Appavoo3 Ajay Joshi1

1Boston University Department of Electrical and Computer Engineering 2Harvard University School of Engineering and Applies Sciences 3Boston University Department of Computer Science

24th International Conference on Parallel Architectures and Compilation Techniques

PACT ’15 1/23

SLIDE 2

Why Do We Care About Neural Networks?

“Good” solutions for hard problems

Capable of learning

Neural networks, again?

The neural network hype cycle has been a bumpy ride Modern, resurgent interest in neural networks is driven by:

Big, real-world data sets “Free” availability of transistors Use of accelerators The need for continued performance improvements

Input Layer Hidden Layer Bias Output Layer Hidden Layer

PACT ’15 2/23

SLIDE 3

Neural Network Computing is Hot (Again)

Existing approaches

Dedicated neural network/vector processors from the 1990s [1] Ongoing NPU work for approximate computing [2, 3, 4] High performance deep neural network architectures [5, 6]

Neural networks as primitives

We treat neural networks as an application primitive

[1]

J. Wawrzynek et al., “Spert-II: a vector microprocessor system,” Computer, vol. 29, no. 3, pp. 79–86, Mar 1996.

[2]

H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.

[3]

R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.

[4]

T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.

[5]

T. Chen, et al. “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS, 2014.

[6]

Z. Du, et al., “Shidiannao: shifting vision processing closer to the sensor,” in ISCA, 2015.

PACT ’15 3/23

SLIDE 4

Our Vision of the Future of Neural Network Computing

Approximate Computing [1] Automatic Parallelization [2] Machine Learning Operating System User/Supervisor Interface Multicontext/threaded NN Accelerator Process 2 Process 1 Process 3 Process N

input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . .

[1]

H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.

[2]

A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.

PACT ’15 4/23

SLIDE 5

Our Contributions Towards this Vision

X-FILES: Hardware/Software Extensions

Extensions for the Integration of Machine Learning in Everyday Systems A defined user and supervisor interface for neural networks This includes supervisor architectural state (hardware)

DANA: A Possible Multi-Transaction Accelerator

Dynamically Allocated Neural Network Accelerator An accelerator aligning with our multi transaction vision

I apologize for the names

There is no association with files or filesystems X-FILES is plural (like extensions)

PACT ’15 5/23

SLIDE 6

An Overview of X-FILES/DANA Hardware

ASID-NNID Table Walker ASID-NNID Table Pointer Transaction Queue Core 1 L1 Data $ L2 $

X-FILES Arbiter

ASID ASID ASID Register File PE-1 PE-N Entry-N Entry-1 PE Table Entry-2 Memory Memory Entry-1 NN Config Cache Control Transaction Table ASID TID NNID State

DANA

Num ASIDs Core 2 L1 Data $ Core N L1 Data $

Components

General purpose cores Transaction storage A backend accelerator that “executes” transactions Supervisor resources for memory safety Dedicated memory interface

PACT ’15 6/23

SLIDE 7

At the User Level We Deal With “Transactions”

Neural Network Transactions

A transaction encapsulates a request by a process to compute the

utput of a specific neural network for a provided input

User Transaction API:

newWriteRequest writeData readDataPoll

Identifiers

NNID: Neural Network ID TID: Transaction ID

Core X-Files Hardware Arbiter

Core/Accelerator Interface

We use the RoCC interface of the Rocket RISC-V microprocessor [1, 2]

[1]

A. Waterman et al., “The risc-v instruction set manual, volume i: User-level isa, version 2.0,” EECS Department, University
f California, Berkeley, Tech. Rep. UCB/EECS-2014-54, May 2014.

[2]

A. Waterman et al.,, “The risc-v instruction set manual volume ii: Privileged architecture version 1.7,” EECS Department,

University of California, Berkeley, Tech. Rep. UCB/EECS-2015-49, May 2015. PACT ’15 7/23

SLIDE 8

At the Supervisor Level We Deal With Address Spaces

Use cases:

Single transaction Multiple transactions Sharing of networks Multiple networks

Application Operating System User/Supervisor Interface Multicontext/threaded NN Accelerator Process 2 Process 1 Process 3 Process N

input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . . input la ye r

utput la

ye r hidde n la ye rs ... ... . . . . . . . . .

Application

We maintain the state of executing transactions We group transactions into Address Spaces Address Spaces are identified by an OS-managed ASID Each ASID defines the set of accessible networks Networks can be shared transparently if the OS allows this

PACT ’15 8/23

SLIDE 9

An ASID–NNID Table Enables NNID Dereferencing

NN Configuration NN Configuration NN Configuration Header Neurons Weights ASID-NNID Table Ptr ASID-NNID IO Queue Num NNIDs ASID-NNID IO Queue Num NNIDs ASID-NNID IO Queue Num NNIDs Status/Header Input *Output Ring Buffers Num ASIDs 0: 1: 2: 0: 1: 2: Layers

ASID–NNID Table

The OS establishes and maintains the ASID–NNID Table We assign ASIDs and NNIDs sequentially The ASID–NNID Table contains an optional asynchronous memory interface

PACT ’15 9/23

SLIDE 10

A Compact Binary Neural Network Configuration

binaryPoint totalEdges totalNeurons totalLayers weightsPtr Info neuron0-weight0 neuron0-weight1 neuron0-weight2 neuron0-weight3 Weights neuron1-weight0

...

neuron0-weight0Ptr neuron0-numberOfWeights neuron0-activationFunction neuron0-steepness neuron0-bias Neurons Layer0-neuron0Ptr neuronsInLayer neuronsInNextLayer layer1-neuron0Ptr neuronsInLayer neuronsInNextLayer Layers neuron1-weight0Ptr

...

We condense the normal FANN neural network data structure

We use a reduced configuration from the Fast Artificial Neural Network (FANN) library [1] containing:

Global information Per-layer information Per-neuron information Per-neuron weights

[1]

S. Nissen, “Implementation of a fast artificial neural network library (fann),” Department of Computer Science University of

Copenhagen (DIKU), Tech. Rep., 2003. PACT ’15 10/23

SLIDE 11

DANA: An Example Multi-Transaction Accelerator

Register File X-FILES Arbiter PE-1 PE-2 PE-N Entry-2 Entry-N Entry-1 PE Table NN Transaction-1 IO Memory NN Transaction-2 IO Memory Entry-2 Cache Memory-1 Cache Memory-2 Entry-1 NN Configuration Cache Control DANA Transaction Table

Components

Control logic determines actions given transaction state Network configurations are stored in a Configuration Cache Per-transaction IO Memory stores inputs and outputs A Register File stores intermediate outputs Logical neurons are mapped to Processing Elements

PACT ’15 11/23

SLIDE 12

DANA: Single Transaction Execution

Register File X-FILES Arbiter PE1 PE Table Cache Memory-1 ASID/NNID NN Configuration Cache DANA Transaction Table PE2 PE3 PE4 Bias Input Layer Hidden Layer Output Layer Per-Transaction IO Memory Control

PACT ’15 12/23

SLIDE 13

DANA: Multi-Transaction Execution

Register File X-FILES Arbiter PE1 PE Table Cache Memory-1 ASID/NNID NN Configuration Cache DANA Transaction Table PE2 PE3 PE4 Per-Transaction IO Memory Control Bias Input Layer Hidden Layer Output Layer Bias Input Layer Hidden Layer Output Layer ASID/NNID Cache Memory-2 TID-1 TID-2 I-1 I-2 I-1 I-2 R-1 R-2 R-3 R-1 R-2 R-3

PACT ’15 13/23

SLIDE 14

We Organize All Data in Blocks of Elements

4 Elements Per Block element 4 element 3 element 2 element 1 8 Elements Per Block element 8 element 7 element 6 element 5 element 4 element 3 element 2 element 1

Blocks for temporal locality

We exploit neural network temporal locality of data Here, data refers to inputs or weights Larger block widths reduce inter-module communication Block width is an architectural parameter

PACT ’15 14/23

SLIDE 15

Evaluation Networks

Area Application Configuration Size ASC [1] 3sum 85 × 16 × 85 large collatz 40 × 16 × 40 large ll 144 × 16 × 144 large rsa 30 × 30 × 30 large Approximate Computing [2, 3, 4] blackscholes 6 × 8 × 8 × 1 small fft 1 × 4 × 4 × 2 small inversek2j 2 × 8 × 2 small jmeint 18 × 16 × 2 medium jpeg 64 × 16 × 64 large kmeans 6 × 16 × 16 × 1 medium sobel 9 × 8 × 1 small Physics [5] edip 192 × 16 × 1 large

[1]

A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.

[2]

H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.

[3]

R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.

[4]

T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.

[5]

J. F. Justo et al., “Interatomic potential for silicon defects and disordered phases,” Physical Review B, vol. 58, pp.

2539–2550, Aug 1998. PACT ’15 15/23

SLIDE 16

Evaluation Methodology

Implementation

X-FILES Arbiter and DANA implemented in System Verilog Free parameters include:

Elements per block The number of Processing Elements Internal table widths and storage sizes

Evaluation

We compute average power with Cadence SOC Encounter in a 45nm GlobalFoundries process We compute operating frequency using Cadence SOC Encounter We compute performance by running System Verilog testbenches at the computed operating frequency

PACT ’15 16/23

SLIDE 17

Power and Performance

5 10 100 200 300 400 Number of Processing Elements Average Power (mW) 4 Elements per Block 5 10 100 200 300 400 Number of Processing Elements 8 Elements per Block 103 104 105 103 104 105 Processing Time (ns)

Processing Elements Cache Register File Transaction Table Control Logic inversek2j fft sobel blackscholes jmeint kmeans collatz rsa jpeg edip 3sum ll PACT ’15 17/23

SLIDE 18

Single Transaction Throughput

5 10 2 4 6 Number of Processing Elements Edges per Cycle 4 Elements per Block 5 10 2 4 6 Number of Processing Elements Edges per Cycle 8 Elements per Block

inversek2j fft sobel blackscholes jmeint kmeans collatz rsa jpeg edip 3sum ll PACT ’15 18/23

SLIDE 19

Multi-Transaction Throughput

5 10 2 4 6 Number of Processing Elements Edges per Cycle 4 Elements per Block 5 10 2 4 6 Number of Processing Elements Edges per Cycle 8 Elements per Block

fft-fft kmeans-fft kmeans-kmeans edip-fft edip-kmeans edip-edip PACT ’15 19/23

SLIDE 20

Multi-Transaction Speedup

1 2 3 4 5 6 7 8 9 10 11 −20 % 0 % 20 % Throughput Speedup 4 Elements per Block 1 2 3 4 5 6 7 8 9 10 11 −20 % 0 % 20 % Number of Processing Elements Throughput Speedup 8 Elements per Block

fft-fft kmeans-fft kmeans-kmeans edip-fft edip-kmeans edip-edip PACT ’15 20/23

SLIDE 21

Software Comparison

NN Energy Delay EDP 3sum 7× 95× 664× collatz 8× 106× 826× ll 6× 88× 569× rsa 6× 88× 566×

Methodology and comments

Comparison against a single core Intel SCC

Performance and power computed using gem5 [1] and McPAT [2]

[1]

N. Binkert et al., “The gem5 simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.

[2]

S. Li et al., “Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures,”

in MICRO, 2009. PACT ’15 21/23

SLIDE 22

Summary and Acknowledgments

ASID-NNID Table Walker ASID-NNID Table Pointer Transaction Queue Core 1 L1 Data $ L2 $

X-FILES Arbiter

ASID ASID ASID Register File PE-1 PE-N Entry-N Entry-1 PE Table Entry-2 Memory Memory Entry-1 NN Config Cache Control Transaction Table ASID TID NNID State

DANA

Num ASIDs Core 2 L1 Data $ Core N L1 Data $

This work was supported by the following:

A NASA Space Technology Research Fellowship An NSF Graduate Research Fellowship NSF CAREER awards A Google Faculty Research Award

PACT ’15 23/23

Towards General-Purpose Neural Network Computing

Schuyler Eldridge1 Amos Waterland2 Margo Seltzer2 Jonathan Appavoo3 Ajay Joshi1

24th International Conference on Parallel Architectures and Compilation Techniques

Why Do We Care About Neural Networks?

“Good” solutions for hard problems

Capable of learning

Neural networks, again?

The neural network hype cycle has been a bumpy ride Modern, resurgent interest in neural networks is driven by:

Big, real-world data sets “Free” availability of transistors Use of accelerators The need for continued performance improvements

Input Layer Hidden Layer Bias Output Layer Hidden Layer

Neural Network Computing is Hot (Again)

Existing approaches

Dedicated neural network/vector processors from the 1990s [1] Ongoing NPU work for approximate computing [2, 3, 4] High performance deep neural network architectures [5, 6]

Neural networks as primitives

We treat neural networks as an application primitive

Our Vision of the Future of Neural Network Computing

Approximate Computing [1] Automatic Parallelization [2] Machine Learning Operating System User/Supervisor Interface Multicontext/threaded NN Accelerator Process 2 Process 1 Process 3 Process N

Our Contributions Towards this Vision

X-FILES: Hardware/Software Extensions

Extensions for the Integration of Machine Learning in Everyday Systems A defined user and supervisor interface for neural networks This includes supervisor architectural state (hardware)

DANA: A Possible Multi-Transaction Accelerator

Dynamically Allocated Neural Network Accelerator An accelerator aligning with our multi transaction vision

I apologize for the names

There is no association with files or filesystems X-FILES is plural (like extensions)

An Overview of X-FILES/DANA Hardware

X-FILES Arbiter

DANA

Components

General purpose cores Transaction storage A backend accelerator that “executes” transactions Supervisor resources for memory safety Dedicated memory interface

At the User Level We Deal With “Transactions”

Neural Network Transactions

A transaction encapsulates a request by a process to compute the

User Transaction API:

newWriteRequest writeData readDataPoll

Identifiers

NNID: Neural Network ID TID: Transaction ID

Core/Accelerator Interface

We use the RoCC interface of the Rocket RISC-V microprocessor [1, 2]

At the Supervisor Level We Deal With Address Spaces

Use cases:

Single transaction Multiple transactions Sharing of networks Multiple networks

We maintain the state of executing transactions We group transactions into Address Spaces Address Spaces are identified by an OS-managed ASID Each ASID defines the set of accessible networks Networks can be shared transparently if the OS allows this

An ASID–NNID Table Enables NNID Dereferencing

*NN Configuration *NN Configuration *NN Configuration Header Neurons Weights ASID-NNID Table Ptr *ASID-NNID *IO Queue Num NNIDs *ASID-NNID *IO Queue Num NNIDs *ASID-NNID *IO Queue Num NNIDs Status/Header *Input *Output Ring Buffers Num ASIDs 0: 1: 2: 0: 1: 2: Layers

ASID–NNID Table

The OS establishes and maintains the ASID–NNID Table We assign ASIDs and NNIDs sequentially The ASID–NNID Table contains an optional asynchronous memory interface

A Compact Binary Neural Network Configuration

...

...

We condense the normal FANN neural network data structure

We use a reduced configuration from the Fast Artificial Neural Network (FANN) library [1] containing:

Global information Per-layer information Per-neuron information Per-neuron weights

DANA: An Example Multi-Transaction Accelerator

Components

Control logic determines actions given transaction state Network configurations are stored in a Configuration Cache Per-transaction IO Memory stores inputs and outputs A Register File stores intermediate outputs Logical neurons are mapped to Processing Elements

DANA: Single Transaction Execution

Register File X-FILES Arbiter PE1 PE Table Cache Memory-1 ASID/NNID NN Configuration Cache DANA Transaction Table PE2 PE3 PE4 Bias Input Layer Hidden Layer Output Layer Per-Transaction IO Memory Control

DANA: Multi-Transaction Execution

We Organize All Data in Blocks of Elements

4 Elements Per Block element 4 element 3 element 2 element 1 8 Elements Per Block element 8 element 7 element 6 element 5 element 4 element 3 element 2 element 1

Blocks for temporal locality

We exploit neural network temporal locality of data Here, data refers to inputs or weights Larger block widths reduce inter-module communication Block width is an architectural parameter

Evaluation Networks

Evaluation Methodology

Implementation

X-FILES Arbiter and DANA implemented in System Verilog Free parameters include:

Elements per block The number of Processing Elements Internal table widths and storage sizes

Evaluation

We compute average power with Cadence SOC Encounter in a 45nm GlobalFoundries process We compute operating frequency using Cadence SOC Encounter We compute performance by running System Verilog testbenches at the computed operating frequency

Power and Performance

5 10 100 200 300 400 Number of Processing Elements Average Power (mW) 4 Elements per Block 5 10 100 200 300 400 Number of Processing Elements 8 Elements per Block 103 104 105 103 104 105 Processing Time (ns)

Single Transaction Throughput

5 10 2 4 6 Number of Processing Elements Edges per Cycle 4 Elements per Block 5 10 2 4 6 Number of Processing Elements Edges per Cycle 8 Elements per Block

Multi-Transaction Throughput

5 10 2 4 6 Number of Processing Elements Edges per Cycle 4 Elements per Block 5 10 2 4 6 Number of Processing Elements Edges per Cycle 8 Elements per Block

Multi-Transaction Speedup

1 2 3 4 5 6 7 8 9 10 11 −20 % 0 % 20 % Throughput Speedup 4 Elements per Block 1 2 3 4 5 6 7 8 9 10 11 −20 % 0 % 20 % Number of Processing Elements Throughput Speedup 8 Elements per Block

Software Comparison

NN Energy Delay EDP 3sum 7× 95× 664× collatz 8× 106× 826× ll 6× 88× 569× rsa 6× 88× 566×

Methodology and comments

NN Configuration NN Configuration NN Configuration Header Neurons Weights ASID-NNID Table Ptr ASID-NNID IO Queue Num NNIDs ASID-NNID IO Queue Num NNIDs ASID-NNID IO Queue Num NNIDs Status/Header Input *Output Ring Buffers Num ASIDs 0: 1: 2: 0: 1: 2: Layers